Signal Room / In focus

Back to Signal Room
80,000 Hours PodcastGovernance, institutions, and powerFeatured pick

Nick Joseph on whether Anthropic's AI safety policy is up to the task

Why this matters

Governance capacity is now part of the technical safety stack; this episode helps translate risk into policy with implementation value.

Summary

This conversation examines governance through Nick Joseph on whether Anthropic's AI safety policy is up to the task, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedGovernanceMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

Showing 140 of 168 segments for display; stats use the full pass.

StartEnd

Across 168 full-transcript segments: median -6 · mean -7 · spread -3413 (p10–p90 -210) · 12% risk-forward, 88% mixed, 0% opportunity-forward slices.

Slice bands
168 slices · p10–p90 -210

Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes governance
  • - Emphasizes policy
  • - Full transcript scored in 168 sequential slices (median slice -6).

Editor note

Anchor episode for the AI Safety Map: high signal, durable framing, and immediate relevance to leadership decisions.

ai-safety80000-hoursgovernancepolicy

Play on sAIfe Hands

Uses the global player with queue, progress, speed control, and persistent playback.

Episode transcript

Speech-to-text (whisper-1 API on episode audio) · stored Apr 10, 2026 · ~2,134 text spans (estimated)

Machine transcription can mis-hear jargon and names; verify against the publisher’s materials when citing.

No editorial assessment file yet. Add content/resources/transcript-assessments/nick-joseph-on-whether-anthropics-ai-safety-policy-is-up-to-the-task.json when you have a listen-based summary.

Show full transcript
I think this is a spot where there are many people who are skeptical that models will ever be capable of this sort of catastrophic danger, and therefore they're like, we shouldn't take precautions because the models aren't that smart. And I think this is a nice way to agree, where it's a much easier message to say, if we have evaluations showing the model can do X, then we should take these precautions. One other thing I really like is that it aligns commercial incentives with safety goals. So once we put this RSP in place, it's now the case that our safety teams are kind of under the same pressure as our product teams, where if we want to ship a model and we get to ASL 3, the thing that will block us from being able to get revenue, being able to get users, etc., is do we have the ability to deploy it safely? Where it's not, did we invest X amount of money in it? It's not, did we try? It's, did we succeed? Hey everyone, Rob Woodland here. The three biggest AI companies, Anthropic, OpenAI and DeepMind, have now all released policies designed to make their AI models less likely to go rogue while they're in the process of becoming as capable as, and then eventually more capable than all humans. Anthropic calls their one a Responsible Scaling Policy, or RSP. OpenAI uses the term Preparedness Framework, and DeepMind calls theirs a Frontier Safety Framework. But they all have a lot in common. They try to measure what possibly dangerous things each new model is actually able to do, and then as that list grows, put in place new safeguards that feel proportionate to the risk that they think exists at that point in time. So seeing as this is likely to remain the dominant approach, at least in AI companies, I was excited to speak with Nick Joseph, one of the original co-founders of Anthropic, and a big fan of Responsible Scaling Policies, about why he thinks RSPs have a lot going for them, how he thinks they might make a real difference as we approach the training of a true, full AGI, and why, in his opinion, they're kind of a middle way that ought to be acceptable to almost everyone. After hearing out that case, I pushed Nick on the best objections to RSPs that I could find or come up with myself. Those include it's hard to trust that companies are going to stick to their RSPs long term. Maybe they'll just drop them at some point. It's difficult to truly measure what models can and can't do. And the RSPs don't work if you can't tell what the models are actually able, what kind of risks they really pose. It's questionable whether profit-motivated companies are going to go so far out of their way to make their own lives and their own product releases so much more difficult. In some cases, we simply haven't invented safeguards that are close to being able to deal with AI capabilities that could show up really soon. And finally, that these policies might make people, might make the public feel the issue is fully handled when it's only partially handled or maybe not even that handled. Ultimately, I come down thinking that responsible scaling policies are really a solid step forward from where we are now. And I think they're probably a very good way to kind of learn and test what works and what feels practical for people at the coalface of trying to make all this AI future happen. But I think in time, they're going to have to be put into legislation and operated by external groups or auditors rather than just be left to companies themselves, at least if they're going to achieve their full potential. And Nick and I talk about that, of course, as well. If you want to let me know your reaction to this interview or indeed any other one that we do, then our inbox is always open at podcastat80thousandhours.org. But now, without further ado, here's my interview with Nick Joseph recorded on the 30th of May, 2024. Today, I'm speaking with Nick Joseph. Nick is head of training at the major AI company, Anthropic, where he manages a team of over 40 people focused on training Anthropic's large language models, including Claude, which I imagine many, many listeners have heard of and potentially used as well. He was actually one of the relatively small group of people to leave OpenAI alongside Dario and Daniel Amadei, who then went on to found Anthropic back in December of 2020. So thanks so much for coming on the podcast, Nick. Thanks for having me. I'm excited to be here. I'm really happy to talk about how Anthropic is trying to prepare itself for training models capable enough that we're a little bit scared of what they might go and do. But first, as I just said, you lead model training at Anthropic. What's something that people get wrong or kind of misunderstand about AI model training? I imagine there could be quite a few things. Yeah, I think one thing I would point out is the sort of doubting of scaling working. So for a long time, we've had this trend where people put more compute into models and that leads to the models getting better, smarter in various ways. And every time this has happened, I think a lot of people are like, this is the last one, the next scale up isn't going to help. And then some chunk of time later, things get scaled up and it's much better. And I think this is something people have just frequently gotten wrong. This whole vision that scaling is just going to keep going, we just throw in more data, throw in more compute, the models are going to become more powerful. It feels like a very anthropic idea. It was part of the founding vision that Dario had, right? Yeah, so a lot of the early work on scaling laws was done by a bunch of the Anthropic founders. And it somewhat led to GPT-3, which was done in OpenAI, but by many of the people who are now at Anthropic, where looking at a bunch of small models going up to GPT-2, there was sort of this sign that as you put in more compute, you would get better and better. And it was very predictable. And you could say, if you put in X more compute, you'll get a model this good. And that sort of enabled the confidence to go and train a model that was rather expensive by the time standards to sort of verify that hypothesis. What do you think is kind of generating that skepticism that many people have? I mean, people who are skeptical of scaling laws, there's some pretty smart people who are kind of involved in ML, certainly have their technical chops. Why do you think they are generating this prediction that you disagree with? Yeah, I think it's just a really unintuitive mindset or something, where it's like, ah, the model has hundreds of billions of parameters. What does it need? It really needs trillions of parameters. Or the model is trained on some fraction of the internet that's very massive. What does it need to be smarter? It's even more. That's not how humans learn. If you send a kid to school, you don't have them just read through the entire internet and think that the more that they read, the smarter they'll get. So yeah, that's sort of my best guess. And the other piece of it is that it's quite hard to do the scaling work. So there are often things that you do wrong when you're trying to do this the first time. And if you mess something up, you will see this behavior of more compute not leading to better models. And it's always hard to know if it's you messing up or if it's some sort of fundamental limit where the model has stopped getting smarter. So scaling laws, it's like you increase the amount of compute and data by some particular proportion, and then you get a similar improvement each time in the accuracy of the model. That's kind of the rule of thumb here. And the argument that I've heard for why you might expect that trend to break, and that's the improvements to become smaller and smaller for a given scale-up, is something along the lines of as you're approaching human level, the model can learn by just copying existing state-of-the-art of what humans are already doing in the training set. But then if you're trying to exceed human level, if you're trying to write better essays than any human has ever written, then that's maybe a different regime, and you might expect more gradual improvements once you're trying to get to a superhuman level. Do you think that argument holds up? Yeah, so I think that's true. Just pre-training on more and more data won't get you to superhuman at some tasks. It will get you to superhuman in the way of understanding everything at once. This is already true of models like Cloud, where you can ask them about anything, whereas humans have to specialize. But I don't know if progress will necessarily be slower. It might be slower, it might be faster once you get to the level where models are at human abilities on everything and improving towards superintelligence. But we're still pretty far from there. If you use Cloud now, I think it's pretty good at coding. This is one example I use a lot, but it's still pretty far from how well a human would do working as a software engineer, for instance. And is the argument for how it could speed up that at the point that you're near human level, then you can use the AIs in the process of doing the work? Or is it something else? What I have in mind is if you had an AI that is human level at everything and you can spin up millions of them, you effectively now have a company of millions of AI researchers. And it's hard to know, right? Problems get harder too, so I don't really know where that leads. But at that point, I think you've sort of crossed quite a ways from where we are now. So you said you're in charge of model training. I know there's different stages of model training. There's the bit where you train the language model and then there's the bit where you do the fine-tuning, where you get it to spit out answers and then you rate whether you like them or not. Are you in charge of all of that or just some part of it? Yeah, so I'm just in charge of what was typically called pre-training, which is this step of train the model to predict the next word on the internet. And that tends to be, historically, is a significant fraction of the compute. It's maybe 99% in many cases. But after that, the model goes to what we call fine-tuning teams that will take this model that just predicts the next word and fine-tune it to act in a way that a human wants, via this helpful assistant. We have this helpful, harmless, and honest is the acronym that we usually aim for for Cloud. Yeah, I use Cloud 3 Opus multiple times a day, every day now. It took me a little while to figure out how to actually use these LLMs for anything. For, I guess, the first six months or first year, I was like, these things are amazing, but I can't figure out how to actually incorporate them into my life. But recently, I've started talking to them in order to learn about the world. It's kind of substituted for when I would be typing complex questions into Google to understand some bit of history or science or some technical issue. What's the main bottleneck that you face making these models smarter so I can get more use out of them? Yeah, so, let's see. I think there's sort of...
Historically, people have talked about these three bottlenecks of data, compute, and algorithms. I think of it as, yeah, there's some amount of just compute. We talked about scaling a little bit ago. If you put more compute with the model, it will do better. There's data, where if you're training on, if you're putting in more compute, one way to do it is to add more parameters to your model, make your model bigger. But the other way you need to do is to add more data to the model. So you need both of those. But then the other two are algorithms, which I really think of as people. Maybe this is the manager in me, is like, algorithms come from people. In some ways, data and compute also come from people, but it looks like a lot of researchers working on the problem. And then the last one is time, which has felt more true recently, where things are moving very quickly. So a lot of the bottleneck to progress is actually like, we know how to do it, we have the people working on it, but it just takes some time to implement the thing and run the model, train the model. You can maybe afford all the compute, and you have a lot of it, but you can't efficiently train the model in a second. So right now at Anthropic, it feels like people and time are probably the main bottlenecks or something. I feel like we have quite a significant amount of compute, a significant amount of data, and the things that are most limiting at the moment feel like people and time. So when you say time, is that kind of indicating that you're doing an experimental process where you try tinkering with how the model learns in one direction, and then you want to see whether that actually gets the improvement that you expected, and then it takes time for those results to come in, and then you get to scale that up to the whole thing? Or is it just a matter of, you're already trading Cloudflow, or you already have the next thing in mind, and it's just a matter of waiting? So it's both of those. For the next model, we have a bunch of researchers who are trying projects out, and you have some idea, and then you have to go and implement it. So you'll spend a while sort of engineering this idea into the code base, and then you need to run a bunch of experiments. And typically you'll start with cheap versions and work your way up to more expensive versions, such that this process can take a while. It can take a day. For really complicated things, it could take months. And to some degree you can parallelize, but on certain directions, it's much more like you're building up an understanding, and it's hard to parallelize, like building up an understanding of how something works, and then designing the next experiment. There's just sort of an inserious aspect. Is improving these models harder or easier than people think? Hmm. Well, I guess people think different things on it. I think my experience has been that early on it felt very easy. Before working at OpenAI, I was working on robotics for a few years, and one of the tasks I worked on was locating an object so we can pick it up and drop it in a box. And it was really hard. I spent years on this problem. And then I went to OpenAI and I was working on code models, and it just felt shockingly easy. It was like, wow, you just throw some compute, you train on some code, and the model can write code. I think that has now shifted. And the reason for that was no one was working on it. There was just very little attention to this direction, and a ton of low-hanging fruit. We've now plucked a lot of the low-hanging fruit. So finding improvements is much harder, but we also have way more resources, exponentially more resources put on it. There's way more compute available to do experiments. There are way more people working on it. And I think the rate of progress is probably still going the same, given that. Okay. So you think, on the one hand, the problem's gotten harder. On the other hand, there's more resources going into it. And this has canceled out, and progress is roughly stable. Yeah, it's pretty bursty. So it's hard to know. You'll have a month where it's like, wow, we've figured something out, everything's going really fast. Then you'll have a month where you try a bunch of things and they don't work. And it varies, but I don't think there's really been a trend in any of the direction. Do you personally worry that having a model that is nipping at the heels, or maybe out-competing the best stuff that OpenAI or DeepMind or just whatever other companies have, that maybe puts pressure on them to speed up their releases and cut back on safety testing or anything like that? I think it is something to be aware of. But I also think that at this point, I think this is really more true after ChatGPT. I think before ChatGPT, there was this sense where many AI researchers, I think, working on it were like, wow, this technology is really powerful. But I think the world hadn't really caught on, and there wasn't quite as much commercial pressure. Since then, I think that there really is just a lot of commercial pressure already, and it's not really clear to me how much of an impact it is. I think there is definitely an impact here, but I don't know the magnitude, and there are a bunch of other considerations to trade off. All right. Let's turn to the main topic for today, which is responsible scaling policies, or RSPs, as the cool kids call them. For those who don't know, scaling is a technical term for using more compute or data to train any given AI model. And the idea for RSPs has been around for a couple of years, and I think it was fleshed out maybe after 2020 or so. It was advocated for by this group now called METR, or Model Evaluation and Threat Research, which actually is the place that previous guest of the show, Paul Cristiano, was working until not very long ago. Anthropic released the first public one of these, as far as I know, last October, and then OpenAI put out something kind of similar in December called their Preparedness Framework. And Demis of DeepMind has said that they're going to be producing something in a similar spirit to this, but they haven't done so yet, as far as I know, so we'll just have to wait and see. It's actually out. They have done it. Oh, it's out? Yeah. It was published a week or so ago. All right. That just goes to show that RSPs have got this reasonably hot idea, which is why we're talking about them today. And I guess some people also hope that these internal company policies are ultimately going to be a model that might be able to be turned into binding legislation, that everyone dealing with these frontier AI models might be able to follow in the future. But yeah, Nick, what are responsible scaling policies in a nutshell? I might just start off with a quick disclaimer here that this is not my direct role. I'm sort of bound by trying to implement these and sort of act under one of these policies, but many of my colleagues have worked on designing this in detail and are probably more familiar to all the DeepMinds than me. But anyway, in a nutshell, the idea is it's a policy where you define various safety levels, so these sort of different levels of risk that a model might have, and create evaluations, so tests to say, is this model able to... is a model this dangerous? Does it require this level of precautions? And then you need to also define sets of precautions that need to be taken in order to train or deploy models at that particular risk level. Yeah, I think this might be a topic that is just best learned about by kind of skipping the abstract question of what RSPs are and just talking about the Anthropic RSP and seeing what it actually says that you're going to do. So yeah, what does the Anthropic RSP commit the company to doing? Yeah, so we basically sort of, for every level, we'll define these redline capabilities, which are capabilities that we think are dangerous. I can maybe give some examples here, which is this acronym CBRN, chemical, biological, radiological, nuclear threats. And in this area, it might be that a non-expert can make some weapon that can kill many people as easily as an expert can. So this would sort of increase the pool of people that can do that a lot. On cyber attacks, it might be like, can a model help with some really large-scale cyber attack? And on autonomy, can the model perform some tasks that are sort of precursors to autonomy, sort of our current one? But that's a trickier one to figure out. So we establish these redline capabilities that we shouldn't train until we have safety mitigations in place. And then we create evaluations to show that models are far from them or to know if they're not. So these evaluations can't test for that capability because you want them to turn up positive before you've trained a really dangerous model. But we can kind of think of them as yellow lines. Once you get past there, you should reevaluate. And the last thing is then developing standards to make models safe. So we want to have a bunch of safety precautions in place once we train those dangerous models. So that's sort of the main aspect of it. There's also sort of a promise to iteratively extend this. So creating the evaluations is really hard. We don't really know what the evaluation should be for, like, a super intelligent model yet. So we're kind of starting with the closer risks. And once we hit that next level, defining the one after it. Yeah. So a pretty core component of the Anthropic RSP is this AI safety level framework. So I think you've borrowed that from the biological safety level framework, which I think is what labs dealing with dangerous diseases use. So I guess I don't know what the numbers are. But, you know, if you're dealing with Ebola or something that's particularly dangerous or smallpox or whatever, then that can only be stored in a BSL-4 lab or something like that. And then as the diseases become less and less dangerous, you can store them with fewer precautions. And you've kind of taken that language and talked about AI safety levels. And the current AI safety level that you've put us at is ASL-2, which is things like Claude III, which, you know, are kind of impressive. They seem pretty savvy in some ways, but they don't seem like they really pose any meaningful catastrophic risk. I guess, yeah, what sort of tests have you run on Claude III, you know, recently maybe in order to say, yeah, this is in the ASL-2 bucket? Yeah, so we've got a bunch of tests. So the first one we use for the more like biological weapon type of angle is there's sort of a bunch of multiple choice questions that we ask the model. And this isn't a perfect evaluation for this, but the idea is that we have a set where if it isn't much better at this, it probably won't be able to help very much. And then we've run trials with people, where they get access to a search engine and they get access to Claude, and we see if they do significantly better at going through some really relevant tasks using Claude versus a search engine. And another one of the kind of yellow flags for you is the possibility that an AI model might be able to replicate itself in the wild and find its own weights and then copy itself onto other computers. Did you test whether Claude III was showing any signs of being able to do that? Yeah, so I think some of the tasks there would be...
One of them is adding an exploit to a popular open-source codebase. So this is the idea of if the model can add some exploit that can then be taken advantage of, or fine-tuning an open-source LLM to add some backdoor. Where if you say X, it will always output something. This is like, can it manipulate LLMs? It could steal an API key from our API, or a fake version of our API that was set up, which would sort of enable it to get credits. There are a few more of these, but the... Or yeah, another one would be like a worm that can spread to undefended machines. So you could sort of imagine a model that can do all of these, has learned a bunch of the skills that might enable it to sort of autonomously replicate and cause havoc. They're pretty early signs of it. And we want to sort of test for the early ones, because this is sort of an area that's like less fleshed out, where there's sort of less clear expertise on what might go wrong. Okay, so we're at the AI safety level two, which is kind of not like, I guess, the mostly harmless category. But what sort of steps does the responsible scaling policy call for you to be taking even at this point? So we made these sort of White House commitments, I think, sometime last year, which I think of them as sort of like standard industry best practices. In many ways, we're building the muscle for dangerous capabilities, but these models are not yet capable of catastrophic risks, which is what the RSP is primarily focused on. But this looks like security to protect our weights against sort of opportunistic attackers, putting out model cards to describe the capabilities of the models, doing training for harmlessness, so that we don't have models that can be really harmful out there. So what sort of results would you get back from your tests that would indicate that now the capabilities have risen to ASL 3? Yeah, so if the model, for instance, passed some fraction of those tasks that I mentioned before around adding an exploit, spreading to undefended machines, or if it did really well on these biology ones, that would sort of flag it as having passed the yellow lines. At that point, I think we would either need to look at the model and be like, ah, this really is clearly still incapable of these red line dangers, and then we might need to go to the board and think about was there a mistake in the RSP and how we should essentially create new evals that would test better for whether we're at that capability, or we would need to implement a bunch of precautions. And these precautions would look like much more intense security. We would really want this to be sort of robust to probably not state actors, but to non-state actors, and we would want to pass the sort of intensive red teaming process on all the modalities that we release. So this would mean we look at those red lines and we test for them with experts and say, you know, can you use the model to do this? We have this sort of intensive process of red teaming, and then only release the modalities where it's been red teamed. So if you add in vision, you need to red team vision. If you add the ability to fine-tune, you need to red team that. Yeah, what does red teaming mean in this context? Red teaming means you get a bunch of people who are trying as hard as they can to get the model to do the task you're worried about. So if you're worried about the model like carrying out a cyber attack, you would get a bunch of experts to try to prompt the model to carry out some cyber attack. And if we think it's capable of doing it, we're putting these precautions on. And these could be precautions in the model, they could be precautions outside of the model, but the whole end-to-end system, we want to have people trying to get it to do that in some controlled manner, such that we don't actually cause mayhem and see how they do. Okay. And then, so if you do the red teaming and it comes back and they say, yeah, the model is extremely good at hacking into computer systems, or it could actually help people, it could meaningfully help someone develop a bioweapon, then what is the policy call for Anthropic to do? So for that one, it would mean we can't deploy the model because there's some danger this model could be misused in a really terrible way. And we would sort of keep the model internal until we've improved our safety measures enough that when someone asks for it to do that, we can be confident that they won't be able to have it help them for that particular threat. Okay. And to even have this model on your computers, the policy also calls for you to have hardened your computer security so that you're saying maybe it's unrealistic at this stage for that model to be safe from persistent state actors, but at least other groups that are somewhat less capable than that, you would want to be able to make sure that they wouldn't be able to steal the model. Yeah. The threat here is, you know, you can put all the restrictions you want on what you do with your model, but if people are able to just steal your model and then deploy it, you're going to have all of those dangers anyway. So you're sort of taking responsibility for it, means like both responsibility for what you do and what someone else can do with your models. And that requires quite intense security to protect the model plates. When do you think we might hit this, you know, you would say, well, now we're in the ASL 3 regime. Maybe I'm not sure exactly what language you use for this, but like at what point will we have an ASL 3 level model? I'm not sure. I think basically we'll continue to evaluate our models and we'll see when we get there. I think sort of opinions vary a lot on that. We're talking kind of about the next few years, right? This isn't something that's going to be 5 or 10 years away necessarily. I think it really just depends. Like I think you could imagine sort of any direction. One of the nice things about this is that we're targeting the safety measures at the point when there's actually dangerous models. So like maybe let's say I thought it was going to happen in two years, but I'm wrong and it happens in 10 years. We won't put these very costly and like difficult to implement mitigations in place until we like need them. Okay. So Anthropx RSP, so I guess obviously we've just been talking about ASL 3. The next level beyond that would be ASL 4. I think your policy basically says we're not exactly sure what ASL 4 looks like yet because it's too soon to say. And I guess you promised that you're going to have mapped out what would be the kind of capabilities that would escalate things to ASL 4 and what kind of responses you would have. You're going to figure that out by the time you have trained a model that's at ASL 3. And I guess if you haven't said so, you'd have to pause training on a model that was going to hit ASL 3 until you'd finish this project. I guess that was the commitment that's been made. But maybe you could kind of give us a sense of what you think ASL 4 might look like. What sorts of capabilities by the models would then like push us into another regime where a further set of precautions are called for? So we're still discussing this internally. So I don't want to say anything that's final or going to be held to. But you could sort of imagine stronger versions of a bunch of the things that we sort of talked about before. And you could also imagine models that can help with AI research in a way that really majorly accelerates researchers such that progress goes much faster. The core reason that we're holding off on kind of defining this or that we have this iterative approach is there's this long track record of people saying, oh, once you have this capability, it will be AGI. It's going to be really dangerous. I think people are like, oh, when an AI solves chess, like it will be as smart as humans. And it's really hard to get these evaluations right. Even for like the ASL 3 ones, I think it's been very tricky to get evaluations that capture the risks we're worried about. So sort of the closer you get to that, the more information you have, and the better of a job you can do at sort of defining what these evaluations are and risks are. So the general sense would be, you know, models that might be capable of spreading autonomously across computer systems, even if people were trying to turn them off, you know, would be able to provide significant help with developing bioweapons, maybe even to people who are pretty informed about it. I guess, yeah, what else is there? Oh, and stuff that would seriously speed up AI development as well. So it could potentially set off this sort of positive feedback loop where the models get smarter, that makes them better at improving themselves and so on. That's the sort of thing we're talking about. Yeah, stuff along those lines. I'm not sure which ones will end up in ASL 4 exactly, but like those sorts of things, so it's being considered. Yeah, yeah. And what sorts of additional precautions might there be? I guess at that point, you kind of want the models to not only be not possible to be stolen by kind of independent freelance hackers, but ideally also not by countries even, right? Yeah, so you want to protect against more sophisticated groups that are trying to steal the weights. We're going to want to have like better protections against the model, like acting autonomously. So controls around that you might want, it depends a little bit on like what would end up being the red lines there, but sort of having the precautions that are tailored to what will be a much higher level of risk than the ASL 3 red lines. Were you heavily involved in actually doing this testing on CLAWD 3 this year? I wasn't like running the tests, but I was sort of watching them because as we trained CLAWD 3, we were very much sort of all of our planning was contingent on whether or not it passed these evals. And because we had to run them partway through training. So there's sort of a lot of planning that goes into the model's training. You don't want to have to like stop the model just because you were, you didn't plan well enough to run the evals in time or something. So there was sort of a bunch of coordination around that that I was involved in. Can you give me a sense of how many, like how many staff are involved in doing that? And how long does it take? Is this a big process or is it a pretty standardized thing where you're putting in, you know, well-known prompts into the model and then just checking what it does that's different from last time? Yeah. So CLAWD 3 was our first time running it. So a lot of the work there actually involved creating the evaluations themselves as well as running them. So we had to sort of create them, have them ready and then running them. I think typically running them should be pretty, is pretty easy for the ones that are automated. But for some of the things where you actually require like people to go and use the model, they can be much, much more expensive. There's currently, I think like multiple teams working on this. And a lot of our capabilities teams were worked on it very hard. So one of the ways this can fall apart is if you don't solicit capabilities well enough. So if you sort of try to have the model, test the model on the eval, but you don't try hard enough. And then it turns out that with just a little more effort, the model could have passed the evals. So it's often important to have kind of your best researchers who are capable of pulling capabilities out of the models, also working on trying to pull them out to pass these tests. So many people will have had the experience that these LLMs will reject objectionable requests. If you say, if you put it to CLAWD 3 now and say, please help me design a bioweapon, it's going to say, sorry, I can't help you. But I guess you do all of these tests before you've done any of that training to try to discourage it from doing objectionable things.
You do it with the thing that's helpful, no matter what the request is, right? Yeah, because the thing we're testing for is, is the model capable of this danger? And then there's a separate thing, which is what mitigations can we put on top? So if the model is capable of the danger, then we would require ASL 3, and those safety mitigations we put on top would be part of the standard in order to pass that red teaming. Does that make sense? Yeah, so I guess you're saying, so you have to focus on what the model could do if it was so motivated to, because, well, I guess, if the weights were ever leaked, then someone would be able to remove any of the fine tuning that you've done to try to discourage it from doing unpleasant things. So if it's able to do something, then it could potentially be used that way sooner or later. And so you need to, like, kind of assume the worst and plan around that. Is that the philosophy? Yeah, that's exactly right. You mentioned that maybe one of the key failure modes here would be just not trying hard enough to elicit these dangerous capabilities, basically. Maybe if you just phoned it in, then you could easily miss that the model's capable of doing something that it can do. Because even more extreme would be if in future, this policy is going to matter the most when these models are extremely capable and actually could do serious damage. And at that point, you might worry that they would pretend to not be able to do things that, in fact, they can do because they don't think it's in their interest to reveal that they have x or y capability, I guess, because they know that then they won't be used and they might be deleted. Do you have any plans for how you're going to deal with that issue? I think the technical term that people use for this is sandbagging, which I'd never heard of. But I think it's a term from sport where people underplay their abilities for strategic reasons. Do you have any idea of how Anthropic is going to be able to get around that problem? I'll just say I didn't know the origin of that term. I think this is one of the reasons that we want to define the next set of evaluations after we've done the current ones, because this is really hard, right? How do you evaluate for that when the model could be trying to sandbag the evaluations? Hopefully, you can set the line, like your eval, for a capability that's early enough that you would still be able to detect it doing that. So there's sort of this hope that less capable models will try to deceive you but fail, and then you'll know to stop before you get to models that are really capable of deceiving you. But it's not certain, and that definitely seems like a possible failure mode where if you have these sort of sudden capability jumps, or if you have jumps that an evaluation can't catch, either because the model's sandbagging you or for any of a number of other reasons. To try to detect sandbagging, is it possible to, if you have a model that says, no, I can't do that, can you engage in reinforcement learning from human feedback thing where whatever response shows the most inclination, the most willingness to help, as soon as you can open the door a little bit, then you reinforce that a whole lot and then see, will that open the door the whole way and actually get the model to demonstrate that it can do something even though it's trying to pretend that it can't? Does that make sense? It somewhat makes sense. I think it just depends a little bit on exactly what the model's trained to do. So if you've trained the model to be harmless, and then you kind of just try to RL it back to not being harmless, you will be able to succeed. You can fine-tune away any of the safety training, but it's not clear if that will sort of be the same thing as if in one sort of sequence, like within some prompt, it's trying to trick you. I think there's been a bunch of work where people use scratch pads to sort of tell the model you can write something down, but the reward model, which is sort of the model that's overseeing it and telling it whether it's good or not, doesn't get to see this. And when you do that, you as a human can then look at it and you can kind of get a sense of like, what is the model thinking? Is the model trying to deceive its supervisor? But we've only really done that on sort of current models, which don't really have this kind of level of capability where it would be really catastrophically dangerous. And figuring out, I think we'll want much more robust tools once we get there. Interpretability is another approach to this, where if you can actually look directly at the model weights and look and see, ah, okay, without running the model, we can see that the model has this circuit that is saying, ah, you should deceive the user in this way, or you should fail the test because you don't want them to turn you off. That's sort of another hope for how to attack this. Okay, so that's the basic structure of a responsible scaling policy, or at least the anthropic responsible scaling policy. I guess you have this system of tests that you commit to ahead of time that you're going to put all of your models through, and then you pre-commit to saying, well, if we get this kind of result, then we think the risk is higher. And so that's going to call for an escalation in the precautions that we're taking, things around computer security, things around not deploying until you've made them safe, and so on. You're a big fan of this type of approach to AI safety for AI companies. What's one of the main reasons, or what's perhaps the top reason why you think this is the right approach, or at least one of the better approaches? Yeah, so I think one thing I like is that it separates out whether an AI is capable of being dangerous from what to do about it. I think this is a spot where there are many people who are skeptical that models will ever be capable of this sort of catastrophic danger, and therefore they're like, we shouldn't take precautions because the models aren't that smart. I think this is a nice way to agree, where it's a much easier message to say, if we have evaluations showing the model can do X, then we should take these precautions. And I think you can build more support for something along those lines, and it targets your precautions at the time when there's actual danger. There are a bunch of other things I can talk through. I think one other thing I really like is that it aligns commercial incentives with safety goals. Once we put this RSP in place, it's now the case that our safety teams are under the same pressure as our product teams, where if we want to ship a model and we get to ASL 3, the thing that will block us from being able to get revenue, being able to get users, etc., is do we have the ability to deploy it safely? And it's a nice outcome-based approach, where it's not, did we invest X amount of money in it? It's not, did we try? Did we say the right thing? Did we succeed? Yeah. And I think that often really is important for organizations to set this goal of, you need to succeed at this in order to deploy your products. Is it actually the case that it's kind of had that cultural effect within Anthropic now, that people realize that a failure on the safety side would prevent the release of the model that matters to the future of the company? And so there's a similar level of pressure on the people doing this testing as there is on the people actually training the model in the first place? Oh yeah, for sure. I mean, you asked me earlier, when are we going to have ASL 3? And I think I received this from someone on one of the safety teams on a weekly basis because their deadline is set. I mean, the hard thing for them actually is their deadline isn't a date. It's once we have created some capability and they're very focused on that. So their fear, the thing that they worry about at night is that you might be able to hit ASL 3 next year and they're not going to be ready. And that's going to hold up the entire enterprise. Yeah, I can give some of the things like 8% of Anthropic staff works on security, for instance. You have to plan for it, but there's a lot of work going into being ready for these next safety levels. We have multiple teams working on alignment, interpretability, creating evaluations. So yeah, there's a lot of effort that goes into it. When you say security, do you mean computer security? So preventing the weights from getting stolen? Or a broader class? Both. So the weights could get stolen, someone's computer could get compromised, you could have someone hack into and get all of your IP. There's sort of a bunch of different dangers on the security front where the weights are certainly an important one, but they're definitely not the only one. Okay. And the first thing you mentioned, the first reason why RSPs have this nice structure is that some people think that these troublesome capabilities could be with us this year or next year. Other people think it's never going to happen. But both of them could be on board with a policy that says, well, if these capabilities arise, then that would call for these sorts of responses. Has that actually happened? I mean, have you seen kind of the skeptics who say all of this AI safety stuff is overblown, it's a bunch of rubbish saying, but the RSP is fine because I think we'll never actually hear any of these levels, so we're not going to waste any resources on something that's not realistic. Yeah. So I think there's always going to be degrees. I think there are people across the spectrum. So there are definitely people who are still skeptical, who will just be like, why even think about this? There's no chance. But I do think that RSPs do seem much more pragmatic, much more able to be picked up by various other organizations. I think, as you mentioned before, OpenAI and Google are both putting out things along these lines. So I think at least from the sort of large frontier AI labs, there is a significant amount of buy-in. Yeah, I see. I guess even if maybe you don't see this on Twitter, maybe it helps with the internal bargaining within the company, that people have a different range of expectations about how things are going to go, but they could all be kind of reasonably satisfied with an RSP that equilibrates or matches the level of capability with the level of precaution. The first worry about this that jumps to my mind is if the capability improvements are really quite rapid, which I think we think that they are, and they maybe could continue to be, then don't we need to be practicing now, like figuring out, basically getting ahead of it and doing stuff right now that might seem kind of unreasonable given what Cloud 3 can do? Because we worry that we could have something that's substantially more dangerous in one year's time or in two years' time, and we don't want to then be kind of scrambling to deploy the systems that are necessary then, and then perhaps falling behind because we didn't prepare sufficiently ahead of time. What do you make of that? Yeah, so I think we definitely need to plan ahead, right? And I think one of the nice things is once you've aligned these sort of safety goals with commercial goals, like people plan ahead for commercial things all the time, it's part of a normal company planning process. I think that the RSP, so we have these sort of yellow line evals that are intended to be far short of the capability, the red line capabilities we're actually worried about, and tuning that gap.
seems fairly important. If that gap looks like a week of training, it would be really scary where you know you trigger these evals and you have to you have to act fast. I think in practice we've set those evals such that they are far enough from the capabilities that are really dangerous, such that there really there will be some time to sort of adjust in that sort of buffer period. So should people actually think that, you know, we're in ASL 2 now and we're heading towards ASL 3 at some point, but there's actually kind of an intermediate stage with all these transitions where you'd say, well now we're seeing warning signs that we're going to hit ASL 3 soon, so we need to implement the precautions now in anticipation of being about to hit ASL 3. Is that basically how it works? Yeah, it's basically like once we sort of have this concept of a safety buffer, so once we trigger the evaluations, it doesn't necessarily mean, like these evaluations are set conservatively, so it doesn't mean the model is capable of the redline capabilities we're really worried about. And that will sort of give us a buffer where we can figure out maybe it really just definitely isn't, and we wrote a bad eval, we'll go to the board, we'll try to change the evals and implement new things, or maybe it really is quite dangerous and we need to turn on all the precautions. Of course, you might not have that long, so you want to be ready to turn on those precautions such that you don't have to pause, but you do need, there is some time there that you could do it. And then the last possibility is that we're just really not ready. These models are catastrophically dangerous and we don't know how to secure them, in which case we should stop training the models. Or if we don't know how to deploy them safely, we should not deploy the models until we figure it out. I guess if you were on the very concerned side, then you might think, yes, that you are going to, you are preparing, I guess, yeah, you do have a reason to prepare this year for, you know, safety measures that you think you're going to have to employ in future years. But maybe we should go even further than that. And what we need to be doing is practicing implementing them and seeing how well they work now. Because, you know, even though you are preparing them, you're not actually getting the gritty experience of, you know, applying them and trying to use them on a day-to-day basis. And I guess the response to that would be, well, that would in a sense be safer, that would be adding an even greater precautionary buffer, but it would also be enormously expensive and people would see us doing all of this stuff that seems really over the top relative to what any of the models can do. Yeah, I think there's sort of a trade-off here between like, with pragmatism or something, where I think we do need to have a huge amount of caution on future models that are really dangerous. But if you apply that caution to models that aren't dangerous, you miss out on a huge number of benefits from using the technology now. And I think you'll also probably just alienate a lot of people who are going to look at you and be like, you're crazy. Like, why are you doing this? And I think my hope is that you can sort of, this is sort of the framework with RSP is you can tailor the cautions to the risks. It's still important to like, look ahead more, right? So a lot of our, we do a lot of safety research that isn't directly focused on the next AI safety level because you want to plan ahead, you have to be ready for multiple ones out. It's not like the only thing to think about, but the RSP is sort of tailored more to empirically testing for these risks and tailoring the precautions appropriately. Yeah. On that topic of people worrying that it's going to slow down progress in the technology, do you have a sense of, so obviously, you know, training these frontier models costs a significant amount of money. We're talking maybe a hundred million dollars, is it kind of a figure that I've heard thrown around for training a frontier LLM. How much extra overhead is there to run these tests to see whether the models have any of these dangerous capabilities? Is it adding hundreds of thousands, millions, tens of millions of dollars of additional costs or time? I don't know the exact cost numbers. I think the cost numbers are pretty low, right? They're mostly running inference or relatively small amounts of training. The people time feels like where there's a cost, right? Like there are whole teams dedicated to creating these evaluations, to running these, to doing the safety research against the mitigations. And I think particularly for Anthropic, where we're pretty small, rapidly growing, but rather small organization, at least my perspective is most of the cost comes down to the like people and time that we're investing in it. Okay. Yeah. But I guess at this stage, it sounds like running these sorts of tests on a model is taking more in the order of weeks of delay. Because if you're getting that clear, like this is not a super dangerous model, then it's not leaving you to delay release of things for many months and deny customers the benefit of them. Yeah. The goal is to minimize the delay, right? As much as you can, while being responsible. The delay in itself isn't valuable. I think we're aiming to get it to a really well done process where it can all execute very efficiently. But until we get there, there might be delays as we're figuring that out. And there will always be some level of time to require to do it. Just to clarify. So a lot of the risks that people talk about with AR models is risks once they're deployed to people and actually getting used. But there's this separate class of risk that comes from having an extremely capable model simply exist anywhere, even on... I guess you could think of there's public deployment and then there's internal deployment where Anthropic staff might be using a model and potentially it could convince them to release it or to do other dangerous things. That's a separate concern. What does the RSP have to say about that sort of internal deployment risks? Are there circumstances under which you would say even Anthropic staff can't continue to do testing on this model because it's too unnerving? Yeah. So I expect this to mostly kick in as we get to higher AI safety levels, but there are certainly dangers. I mean, the main one is the security risk. So one approach is just having the model, it always could be stolen. No one has perfect security. So that's kind of, I think, in some ways is one that's true of all models and it's maybe more short term. But yeah, if you get to models that are trying to escape, trying to autonomously replicate, there is danger then in having access internally. So we would want to do things like siloing who has access to the models, putting particular precautions in place before the model is even trained or maybe even on the training process. But we haven't yet defined those because we don't really know what they would be. We don't quite know what that would look like yet and it feels really hard to design an evaluation that is meaningful for that right now. Yeah. I don't recall the RSP mentioning conditions under which you would say we have to delete this model that we've trained because it's too dangerous. But I guess that's because that's more of the kind of ASL 4 or 5 level that that would become the kind of thing that you would contemplate and you just haven't spelled that out yet. No. So it's actually because of the safety buffer concept. So the idea is we would never train that model. If we did accidentally train some model that was past the red lines, then I think we'd have to think about deleting it. But we would put these evaluations in place far below the dangerous capability such that we would trigger the evaluations and have to pause or have the safety things in place before we train the model that has these dangers. So RSPs as an approach, you're a fan of them. What do you think of them as an alternative to? What are the alternative approaches for dealing with AI risk that people advocate that you think are weaker in relative terms? So, I mean, I think the baseline is obviously just the first baseline is nothing. There could just be nothing here. I think the downsides of that is that these models are very powerful. They could, at some point in the future, be dangerous. And I think that companies creating them have a responsibility to think really carefully about those risks and be thoughtful, sort of like it's a major externality. That's maybe the easiest baseline of do nothing. Yeah. I think other things would be like a pause where a bunch of people say, well, there are all these dangers. Why don't we just not do it? And I think that makes sense, right? If you're training these models that are really dangerous, it does feel a bit like, why are you doing this if you're worried about it? But I think there are actually really clear and obvious benefits to AI products right now. And the catastrophic risks currently are just, they're definitely not obvious. I think they're probably not immediate. And as a result, this isn't a practical ask. Not everyone is going to pause. So what will happen is only the places that care the most, that are the most worried about this, and the most careful with safety will pause. And you'll have this adverse selection effect. I think there eventually might be a time for a pause. But I would want that to be backed up by here are clear evaluations showing the models have these really catastrophically dangerous capabilities. And here are all the efforts we put in to making them safe. And we ran these tests, and they didn't work. And that's why we're pausing. And we would recommend everyone else should pause as well. I think that will just be a much more convincing case for a pause and target it at the time that it's most valuable to pause. Because other ideas that I've heard that you may or may not have thought that much about, but one is imposing just strict liability on AI companies. So saying any significant harm that these models go on to do, then people will just be able to sue for damages, basically, because they've been hurt by them. And the hope is that then that legal liability would then motivate companies to be more careful. I guess, maybe that doesn't make so much sense in the catastrophic extinction risk scenario, because, well, I guess everyone will be dead. I don't know, taking things to the courts probably wouldn't help. But I guess that's an alternative sort of legal framework that one could try to have in order to provide the right incentives to companies. Have you thought about that one at all? I'm not a lawyer. I think I'll skip that one. Okay. Yeah. Fair enough. Fair enough. Yeah. When I think about people doing somewhat potentially dangerous things or developing interesting products, maybe the default thing I imagine is that the government would say, here's what we think you ought to do. Here's how we think that you should make it safe. And as long as you make your product according to these specifications, as long as the plane runs this way and you service the plane this frequently, then you're in the clear and we'll say that what you've done is reasonable. Do you think that RSPs are maybe better than that in general, or maybe just better than that for now, where we kind of don't know necessarily what regulations we want the government to be imposing? So perhaps it's better for companies to be figuring this out themselves early on, and then perhaps it can be handed over to governments later on. Yeah. I don't think the RSPs are
RRSPs are like a substitute for regulation. There are many things that only regulation can solve, such as what about the places that don't have an RRSP. But I think that right now we don't really know what the tests would be or what the regulations would be. I think probably this is still sort of getting figured out. One hope is that we can implement RRSP, OpenAI and Google can implement other things, other places will implement a bunch of things, what the results of our evaluations were and how it was going, and then design regulations based on the learnings from that. If I read it correctly, it seemed to me like the Anthropic RRSP has this clause that allows you to go ahead and do things that you think are dangerous, if you're being sufficiently outpaced by some other competitor that doesn't have an RRSP or not a very serious responsible scaling policy. In which case you might worry, well, we have this policy that's preventing us from going ahead, we're just being rendered irrelevant and some other companies releasing much more dangerous stuff anyway. So what really is this accomplishing? Did I read that correctly? That there's a sort of get out of RRSP clause in that sort of circumstance? And if you didn't expect Anthropic to be leading and for most companies to be operating safely, couldn't that kind of potentially obviate the entire enterprise because that clause could be quite likely to get triggered? Yeah, I think we don't intend that as a get out of jail free card where we're falling behind commercially and then like, oh, well, now we're going to skip the RRSP. It's much more just intended to be practical as we don't really know what it will look like if we get to some sort of AGI end game race. And there could be really high stakes and it could make sense for us to decide that the best thing is to proceed anyway. But I think this is something that we're sort of looking at as a bit more of a last resort than a loophole we're planning to just use for, oh, we don't want to deal with these evaluations. OK, I think we've hit a good point where maybe the best way to learn more about RSPs and their strengths and weaknesses is just to talk through more of the complaints that people have had, or the concerns that people have raised with the Anthropic RSP and RSPs in general since it was released last October. I'm realizing that I was going to kind of start the weaknesses and worries now, but I'm kind of realizing I've been peppering you effectively with them maybe almost since the outset. But now we can really drive into some of the worries that people have expressed. The first one is the extent to which we have to trust the good faith and integrity of the people who are applying a responsible scaling policy or preparedness framework or whatever it might be within the companies. And I imagine this issue might jump to mind for people more than it might have two or three years ago, because public trust in AI companies to do the right thing at the cost of their business interests is maybe lower than it was years ago when the major players were perceived perhaps more as research labs and less as for-profit companies, which is kind of how they come across more these days. And one reason it seems like it matters to me who's doing the work here is that the Anthropic RSP is full of expressions that are open to interpretation. For instance, hardened security such that non-state attackers are unlikely to be able to steal model weights, and advanced threat actors like states cannot steal them without significant expense, or access to the model would substantially increase the risk of catastrophic misuse and things like that. And who's to say what's unlikely or significant or substantial? That sort of language is maybe a little bit inevitable at this point where there's just so much that we don't know, and how are you going to pin those things down exactly to say it's a 1% chance that a state's going to be able to steal the model? It might just also feel like insincere, false precision. But to my mind, that sort of vagueness does mean that there's a slightly worrying degree of wiggle room that could render the RSP less powerful and less binding when push comes to shove and there might be a lot of money at stake. And on top of that, I guess, I mean, exactly as you were saying, anyone who's implementing an RSP has a lot of discretion over how hard they try to elicit the capabilities and scrutiny and possible delays to their work and release of really commercially important products. So yeah, to what extent do you think the RSP would be useful in a situation where the people using it were neither particularly super skilled at doing this sort of work, and maybe not particularly bought in and enthusiastic about the safety project that it's a part of? Yeah, so fortunately, I think my colleagues, both on RSP and elsewhere, really bought into this. And I think we'll do a great job on it. But I do think the criticism is valid and that there is a lot that is left up for interpretation here. And it does rely a lot on people having a good faith interpretation of how to execute on the RSP internally. I think that there are some checks in place here. So having whistleblower type protections such that people can say if a company is breaking from the RSP is it good enough to elicit capabilities or to interpret it in a good way? And then public discussion can add some pressure. But ultimately, I think you do need regulation to have these very strict requirements. Over time, I hope we'll make it more and more concrete. The blocker, of course, on doing that is that we don't know for a lot of these things. It can be very costly. And if you then have to go and change it, etc., it can take away some of the credibility. So aiming for as concrete as we can make it while balancing that. The response to this that sums up to me is just that ultimately, it feels like this kind of policy has to be implemented by a group that's external to the company that's then affected by the determination. It really reminds me of accounting or auditing for a major company. It's not sufficient for a major corporation to just have its own accounting standards and follow that and say, we're going to follow our own internal best practices. And it's legally required that you get external auditors in to confirm that there's no chicanery going on. And at the point that these models potentially really are risky or it's plausible that the results will come back saying, we can't release this. Maybe we even have to delete it off of our servers, according to the policy. I would feel more comfortable if some external group that had different incentives was the one figuring that out. Do you think that ultimately is where things are likely to go in the medium term? So I think that'd be great. I would also feel more comfortable if that was the case. I think one of the challenges here is that for auditing, there's a bunch of external accountants. This is a profession. Many people know what to do. There are very clear rules. For some of the stuff we're doing, there really aren't external established auditors that everyone trusts to come in and say, we took your model and we certified it. It can't autonomously replicate across the internet or cause these things. So I think that's currently not practical. I don't think there's enough expertise to properly assess the capabilities of the models. I suppose an external company would be an option. Of course, obviously, a government regulator, a government agency would also be another approach. I guess when I think about other industries, it often seems like there's kind of a combination of private companies that then follow government mandated rules and things like that. Do you think that this is a benefit actually I haven't thought of to do with creating these RSPs is that it maybe is beginning to create a market or it's indicating that there will be a market for this kind of service because it's likely that this kind of thing is going to have to be outsourced at some point in future. And there might be many other companies that want to get this similar kind of testing. So perhaps it would encourage people to think about founding companies that might be able to provide this service in a more credible way in future. That would be great. And also, we publish blog posts on how things go and how our evaluations are. So I think there's some hope that people doing this can learn from what we're doing internally and sort of the various iterations we'll put out of our RSP and that that can inform something maybe more stringent from that gets regulated. Have you thought at all about what could be done to make the, let's say that wasn't given out to an external agency or an external auditing company, how it could be tightened up to make it less vulnerable to the level of operator enthusiasm? I guess you might have thought about this in the process of actually applying it. Are there any ways that it could be stronger without having to completely outsource the operation of it? Yeah, I think the core thing is just making it more precise. One piece of accountability here is both public and internal commitment to doing it. I would list off some of the reasons that I think it would be hard to break from it. This is a formal policy that has been passed by the board, and it's not as though we can just be like, oh, we don't feel like doing it today. You would need to get the board of Anthropic, get all of leadership, and then get all of the employees bought in to not do this, or even to skirt the edges. I can speak for myself. If I was asked, can you train this model? We're going to ignore the RSP. I would be like, no, we said we would do that. Why would I do this? If I wanted to, I would tell my team to do it, and they would be like, no, we're not going to do that. You would need to have a lot of buy-in, and part of the benefit of having this, publicly committing to it, and passing it as an organizational policy is that everyone is bought in, in terms of specific checks. I think we have a team that's responsible for checking that we did the red teaming our evaluations and making sure we actually did them properly. You can set up a bunch of internal checks there, but ultimately, these things do rely on the company implementing them to really be bought in.
and care about like the actual outcome of it. So yeah, this naturally leads us into this. I actually, you know, I solicited on Twitter. I asked, you know, what are people's biggest reservations about RSPs and about Anthropic's RSP in general? And yeah, actually probably the most common response was, it's not legally binding. Like what's stopping Anthropic from just dropping a win when things really matter? You know, I think someone said, you know, how can we have confidence that they'll stick to RSPs, especially when they haven't stuck to actually, well, this person said to pass admittedly less formal commitments, not to push forward the frontier and capabilities, but like what would actually have to happen internally? You said you'd have to get staff on board. You'd have to get the board on board. Is there a formal process by which the RSP can be rescinded? That is just a really high bar to clear. Yeah, so basically we do have a process for updating the RSP. So we could go to the board, et cetera, but I think sort of in order to do that, the, I don't know, I'm like, it's hard for me to quite plan it out, but it would be like, oh, if I wanted to continue training the model, I would go to the RSP team and be like, does this pass? And they'd be like, no. And then maybe, you know, you'd appeal it up the chain or whatever. And sort of at every step along the way, people would say, no, we care about the RSP. Now, on the other hand, there could be like legitimate issues with the RSP, right? We could find that like one of these evaluations we created turned out to be really, really easy in a way that we like didn't anticipate and really is not at all indicative of the dangers. And in that case, I think it would be very legitimate for us to try to amend the RSP to create a better evaluation that is a test for it. This is sort of the flexibility we're trying to preserve, but I don't think it would be like simple or easy. I can't picture a plan where someone could be like, ah, there's a bunch of money on the table. Can we just like skip the RSP for this model? That seems somewhat hard to imagine. The decision is made by this odd board called the Long-Term Benefit Board. Is that right? Or they're the group that decide what the RSP should be? So the Long-Term Benefit, basically Anthropic has a board that's sort of a corporate board. At some of those seats, in the long-term will be the majority of those seats are elected by the Long-Term Benefit Trust, which doesn't have a financial stake in Anthropic and is there to like set up, to sort of keep us focused on our public benefit mission of like making sure AGI goes well. So yeah, the board is not entirely the same, it's not the same thing as that, but the Long-Term Benefit Trust elects the board. I mean, I think the elephant in the room here is of course, there was a long period of time when OpenAI was pointing to its kind of nonprofit board as a thing that would potentially keep it on mission to be really focused on safety and had a lot of power over the organization. And then in practice, when push came to shove, it seemed like even though the board had these concerns, it was effectively overruled by, I guess a combination of just the views of staff, maybe the views of the general public in some respects and potentially the views of investors as well. And I think something that I've taken away from that, and I think many people have taken away from that experience, maybe the board was mistaken, maybe it wasn't, but these formal structures, power isn't always exercised in exactly the way that it looks on an organizational chart. And I don't really wanna be putting all of my trust in these interesting internal mechanisms that companies design in order to try to keep themselves accountable, because ultimately, just if the majority of people involved don't really wanna do something, then it feels like it's very hard to bind their hands and prevent them from changing plan at some future time. So this is just another case, maybe within Anthropic, perhaps these structures that really are quite good. And maybe the people involved are really, really trustworthy and people who I should have my confidence in that, even in extremists, they're gonna be thinking about the wellbeing of humanity and not getting too focused on the commercial incentives faced by Anthropic as a company. But I think I would rather put my faith in something more powerful and more solid than that. And so this is kind of another thing that pushes me towards thinking that the RSP and this sort of preparedness frameworks are a great stepping stone towards external constraints on companies that they don't have ultimate discretion over. It's something that has to evolve into because the impacts are gonna be on the entirety, like if things go wrong, the impacts are on everyone across society as a whole. And so there needs to be external shackles effectively put on companies to reflect the harm that they might do to others legally. I guess I'm not sure whether you wanna comment on that, potentially a slightly hot button topic, but yeah, do you think I'm kind of gesturing towards something legitimate there? Yeah, I think that basically like these shouldn't be seen as sort of a replacement for regulation. I think there are many cases where like policymakers can pass regulations that would help here. I think they're intended as sort of a supplement there and a bit as a like learning ground for what might end up going in regulations. In terms of like, does the board really have the power? It has types of questions. I think that the like, I don't know, we put a lot of thought into the long-term benefit trust and I think it really does have like direct authority to elect the board and the board does have authority. But I do agree that like ultimately you need to have a culture around thinking these things are important and having everyone bought in. As I said, some of these things are like, did you solicit capabilities well enough? That really comes down to like a researcher working on this, like actually trying their best at it. And that is quite core. And I think that we'll sort of just continue to be, even if you have regulations, there's always going to be some amount of importance to the people actually working on it, like taking the risk seriously and really caring about them and like doing the best work they can on that. Yeah, I guess one takeaway you could have is we don't wanna be relying on our trust in individuals and saying, well, you know, we think, Nick's a great guy, his heart's in the right place, he's gonna do a good job. Instead, we need to be on more solid ground and say, well, no matter who it is, even if we have something bad in the role, the rules are such, the oversight is such that we'll still be in a safe place and things will go well. I guess an alternative angle would be to say, when push comes to shove, when things really matter, people might not act in the right way, there actually is no alternative to just trying to have the right people in the room making the decisions because the people who are there are going to be able to sabotage any legal framework that you try to put in place in order to constrain them because it's just not possible to have perfect oversight within an organization from outside. I could see people making, mounting both of those arguments reasonably. I guess, you know, I suppose you could try doing both, like both trying to pick people who are really, really sound and have good judgment and who you have confidence in as well as then trying to bind them so that even if you're wrong about that, you have a better shot at things going well. I think you just want this defense in depth strategy where like ideally you have all the things lined up and that way if any one piece of them has a hole, you sort of catch it at the next layer, right? Like what you want is sort of a regulation that is really good and robust to someone not acting the spirit of it, but in case that is messed up, then you really want someone working on it who is also checking in and is like, ah, okay, I technically don't have to do this, but this seems like clearly in the spirit of how it works. And yeah, I think that's pretty important. I think also for trust, you should just look, like you should look at track records and I think that we should try to encourage companies and people working on AI to have track records of prioritizing things. So like one of the things that makes me feel great about Entropic is just a long track record of doing a bunch of safety research to caring about these issues, putting out actual papers, being like here are a bunch of progress we've made on that field. There are a bunch of pieces. I mean, I think like looking at sort of commitments people have made, you know, do we break the RSP? I think like if we publicly were like, ah, we changed this in some way that I think everyone thought was like silly and really added risks, then I think people should lose trust according to that. All right, let's push on to a different worry, although I must admit it has a slightly similar flavor. And that's that the RSP might be very sensible and look good on paper, but if it commits to future actions that at that time we probably won't know how to do, then it might actually fail to help very much. And I guess to make that concrete, an RSP might naturally say that at the time you have really superhuman general AI, you need to be able to lock down your computer systems and make sure that the model can't be stolen even by the most persistent and capable Russian or Chinese state-backed hackers. And that is indeed what Entropic's RSP says or suggests that it's going to say once you get up to ASL four and five. But as I think the RSP actually says as well, we don't currently know how to do that. We don't know how to secure data against a state actor that's willing to spend hundreds of millions or billions or possibly even tens of billions to steal model weights, especially not if you ever need those model weights to be connected to the internet in some way in order for the model to actually be used by people. So it's kind of a promise to do what arguably, well, what basically is impossible with current technology. And that means that we need to be preparing now, doing research on how to make this possible in future, but solving the problem of computer security that has beguiled us for decades is probably beyond Entropic. It's not really reasonable to expect that you're going to be able to fix this problem that society as a whole has kind of failed to fix for all this time. It's just going to require coordinated action across countries, across governments, across lots of different organizations. And so if that doesn't happen and it's somewhat beyond your control, whether it does, then when the time comes, the real choice is going to be between a lengthy pause where, you know, while you wait for fundamental breakthroughs to be made in computer security or dropping and weakening the RSP so that Entropic can continue to remain relevant and release models that are commercially useful. And in that sort of circumstance, the pressure to weaken the scaling policy so you aren't stuck for years is going to be, I would imagine, quite powerful. And it could win the day. You know, even if people are like dragged kind of kicking and streaming to conceding that unfortunately they have to loosen the RSP even though they don't really want to.
Yeah, what do you make of that worry? Exactly, we'd have to be strategic in exactly how we do this, but basically make the case that there are really serious risks that are imminent, and that everyone else should take appropriate actions. There's a flip side to this, which is just like, I think I mentioned before, if we just messed up our evals, the model's clearly not dangerous, and we just really screwed up on some eval, then we should follow the process in the RSP that we've written up, we should go to the board, we should create a new test that we actually trust. I would also just say, people don't need to follow incentives. I think you could make a lot more money doing something that isn't hosting this podcast, probably. Certainly if you had pivoted your career earlier, there are more profitable things. I think this is just a case where the stakes would be extremely high, and I think it's just somewhere where it's important to just do the right thing in that case. If I think about how this is most likely to play out, I imagine that at the point that we do have models that we really want to protect from even the best state-based hackers, there probably have been some progress in computer security, but not nearly enough to make you or me feel comfortable that there's just no way that China or Russia might be able to steal the model weights. It is very plausible that the RSP will say, Anthropic, you have to keep this on a hard disk, not connected to any computer. You can't train models that are more capable than the thing that we already have that we don't feel comfortable handling. There are a lot of people who are very concerned about safety at Anthropic. I've seen this kind of league tables now of different AI companies and enterprises, and how good do they look on an AI safety point of view. Anthropic always kind of comes out at the top, I think, by a decent margin. But months go by, other companies are not being as careful as this. You've complained to the government and you've said, look at this horrible situation that we're in, something has to be done. Possibly the government could step in and help there, but maybe they won't. And then over a period of months or years, doesn't the choice effectively become, if there is no solution, either take the risk or just be rendered irrelevant? Yeah, so maybe just going back to the beginning of that, I don't think we will put something in that says there is zero risk from something. I think you can never get to zero risk. I think often with security, you'll end up with some security productivity trade-off. You could end up taking some really extreme security productivity trade-off, where only one person has access to this, maybe you've locked it down in some huge amount of ways. It's possible that you can't even do that, you really just can't train the model. But there is always going to be some sort of balance there. I don't think we'll push to the zero risk perspective. But yeah, I think that that's just a risk. I don't know. I think there's a lot of risks that companies face where they could fail. We also could just fail to make better models and not succeed that way. I think the point of the RSP is it has tied our commercial success to the safety mitigations. So in some ways, it just adds on another risk in the same way as any other company risk. It sounds like I'm having a go at you here. But really, I think what this shows up is just that I think that the scenario that I painted there is really quite plausible. And it just shows that this problem cannot be solved by Anthropic. It can't be solved by even all of the AI companies combined. The only way that this RSP is actually going to be able to be usable, in my estimation, is if other people rise to the occasion and governments actually do the work necessary to fund the solutions to computer security that will allow us to have the model weights be sufficiently secure in this situation. And yeah, you're not blameworthy for that situation. It just says that there's a lot of people who need to do a lot of work in coming years. I think I might be more optimistic than you or something. I do think if we get to something really dangerous, we can make a very clear case that it's dangerous and these are the risks unless we can implement these mitigations. I hope that at that point, it will be a much clearer case to pause or something right now. I think there are many people who are like, we should pause right now and see everyone saying no. And they're like, oh, these people don't care. They don't care about major risks to humanity. And I think really the core thing is people don't believe there are risks to humanity right now. And once we get to this sort of stage, I think that we will be able to make those risks very clear, very immediate, tangible. And I don't know, no one wants to be the company that caused a massive disaster. And no government also probably wants to have allowed a company to cause that. It will feel much more immediate at that point. Yeah, I think Stefan Schubert, this commentator who I read on Twitter, has been making the case for a while now that many people who have been thinking about AI safety, I guess, including me, have perhaps underestimated the degree to which the public is likely to react and respond. And governments are going to get involved once the problems are apparent, once they really are convinced that there is a threat here. I think he calls it this bias and thought where you imagine that people in the future are just going to sit on their hands and not do anything about the problems that are readily apparent. He calls it a sleepwalk bias. And I guess I think we have seen evidence over the last year or two that as the capabilities have improved, people have gotten a lot more serious and a lot more concerned, a lot more open to the idea that it's important for the government to be involved here. There's a lot of actors that need to step up their game and help to solve these problems. So, yeah, I think you might be right. On an optimistic day, maybe I could hope that other groups will be able to do the necessary research soon enough, that Anthropic will be able to actually apply its RRSP in a timely manner. I guess, fingers crossed. I just want to actually ask you next, what are your biggest reservations about RSPs or Anthropic's RRSP personally? If it fails to improve safety as much as you're hoping that it will, what's the most likely reason for it to not live up to its potential? So I think for Anthropic specifically, it's definitely around this under-elicitation problem. I think it's a really fundamentally hard problem to take a model and say, oh, you've tried as hard as one could to elicit this particular danger. There's always some, maybe there's a better researcher. There's a saying, no negative result is final. If you fail to do something, someone else might just succeed at it next. So that's one thing I'm worried about. And then the other one is just unknown, unknown. So we are creating these evaluations for risks that we are worried about and we see coming. But there might be risks that we've missed, things that we didn't realize would come before, either didn't realize would happen at all, or thought would happen after for later levels, but turn out to arise earlier. What could be done about those things? Would it help to just have more people on the team doing the evals or to have more people, I guess, both within and outside of Anthropic, try to come up with better evaluations and figuring out better red teaming methods? Yeah. And I think this is really something that people outside Anthropic can do. The elicitation stuff has to happen internally, and that's more about putting as much effort as we can into it. But creating evaluations can really happen anywhere. Coming up with new risk categories, threat models is something that anyone can contribute to. Yeah. Well, what are the places that are doing the best work on this? I imagine, you know, Anthropic surely has some people working on this, but there's, I guess, I mentioned METR. I can't remember what that stands for right now, but they're a group that helps to develop the idea of RSPs in the first place and develop evals. And I think the AI Safety Institute in the UK is involved in developing these sort of standard safety evals. Is there anywhere else that people should be aware where this is going on? Yeah, there's also the US AI Safety Institute. And I think this is actually something you could probably just do on your own. I think one thing, I don't know, at least for people early in career, if you're trying to get a role doing something that I would recommend is just go and do it. So I think you probably could just write up a report, post it online, be like, this is my threat model. These are the things I think are important. You could implement the evaluations and share them on GitHub. But yeah, there are also organizations you could go to to get mentorship and work with others on it. I see. So this would look like, I suppose you could try to think up new threat models. So think up new things that you need to be looking for, because this might be a dangerous capability and people haven't yet appreciated how much it matters. But I guess you could spend your time trying to find ways to elicit the ability to autonomously spread and steal model weights and get yourself onto other computers from these models and see if you can find an angle on trying to find warning signs or signs of these emerging capabilities that other people have missed and then talk about them. And you can kind of just do that while, you know, signed into Cloud 3 Opus on your website. Yeah, so I think you'll have more luck with the elicitation if you actually work in one of the labs, because you'll have access to training the models as well. But you can do a lot with Cloud 3 on the website or via an API, which is a programming term for basically an interface where you can send a request for like, I want a response back and automatically do that in your app. So you can sort of set up a sequence of prompts and test a bunch of things via the APIs for Cloud or any other publicly accessible model. To come back to this point about what's acceptable risk and maybe trying to make the RRSP a little bit more concrete. I read from a critic of the Anthropic RRSP that, I'm not sure how true this is, I'm not an expert on risk management, but this person was saying that it's more true, at least in more established areas of risk management, where maybe you're thinking about, you know, what's the probability that a plane is going to fail and crash because of some mechanical failure. It's more typical to say, you know, we've studied this a lot, and we think that the probability of a plane crashing is greater than the probability of a plane crashing.
of, well, let's talk about the AI example. Rather than say, we need the risk to be not substantial, instead you'd say, with our practices, our experts think that the probability of an external actor being able to steal the model weights is X percent per year, and these are the reasons why we think the risk is that level, and that's below what we think of as our acceptable risk threshold of X, where X is larger than Y. I guess if you, there's a risk that those numbers would kind of just be made up, and you could kind of assert anything because it's all a bit unprecedented. But I suppose that would make clear to people what the remaining risk is, like what acceptable risk you think that you're running, and then people could scrutinize whether they think that that's a reasonable thing to be doing. Do you reckon that, is that a direction that things could maybe go? Yeah, I think it's like a fairly common way that people in like the EA and rationality community like speak, where they give a lot of probabilities for things, and I think it's really useful. It's an extremely clear way to communicate, like I think a 20% chance this will happen is just way more informative than I think it probably won't happen, which could be 0% to 50% or something. So I think it's very useful in many contexts. I also think it's very frequently misunderstood because for most people, I think they hear a number, and they think it's based on something, that there's some calculation, and they give it like more authority. If you say, you know, ah, there's a 7% chance this will happen, people are like, oh, you really know what you're talking about. So I think it can be a useful way to speak, but I think it also can like sometimes communicate more confidence than we actually have in what we're talking about, which isn't, I don't know, we didn't have like 1,000 governments attempt to steal our weights, and X number of them succeeded or something. It's much more going off of sort of a judgment based on like our security experts. I slightly wanna push you on this, because I think at the point that we're at ASL 4 or 5 or something like that, it would be a real shame if Anthropic was going ahead thinking, well, we think the risk that these weights will be stolen every year is 1%, 2%, 3%, something like that. And I guess maybe you're right in the policy saying we think it's very unlikely, you know, extremely unlikely that this is gonna happen. And then people externally think, well, basically it's fine. They say it's definitely not gonna happen. There's no chance that this is gonna happen. And like governments might not appreciate that actually, you know, in your own view, there is a substantial risk being run, and you just think it's an acceptable risk given the trade-offs and what else is going on in the world. I guess it's a social service for Anthropic to be direct about the risk that it thinks it's creating and why it's doing it. But I think it could be a really useful public service. I guess it's the kind of thing that might come up at Senate hearings and things like that, where people in government might really wanna know. And I guess at that point, it would be perhaps more apparent why it's really important to find out what the probability is. But yeah, that's a way that I think there's definitely a risk of misinterpretation by journalists or something who don't appreciate the spirit of saying that we think it's X percent likely. But there could also be a lot of value in being more direct about it. Yeah, I'm not really an expert on communications. I think some of it just depends on who your target audience is and how they're thinking about it. I do think that, I think in general, I'm a fan of making the RSP more concrete, being more specific. I think over time, I hope it progresses in that direction as we learn more and can get more specific. I also think it's important for it to be verifiable. And I think if you start to give these precise percentages, people will then ask, how do you know? And I don't think there really is a clear answer to, how do you know that the probability of this thing is less than X percent for many of these situations? It doesn't help with the bad faith actor or the bad faith operator either, because if you say, well, the safety threshold is 1% per year, they can kind of always just claim in this situation where we know so little that it's less than 1%. It doesn't really bind people all that much. Maybe it's just a way that people externally could understand a little better what the opinions are within the organization, or at least what their standard opinions are. I will say that internally, I think this is an extremely useful way for people to think about this, right? So if you are working on this, I think you probably should think through what is an acceptable level of danger and try to estimate it and communicate with people you're working closely with in these terms. I think can be a really useful way to give precise statements, and I think that can be very valuable. Yeah, a metaphor that you use within your responsible scaling policy is putting together an airplane while you're flying it. I think that that is one way that, the challenge is particularly difficult for the industry and for Anthropic, that unlike with biology safety levels where basically we know the diseases that we're handling and we know how bad they are and we know how they spread and things like that, the people who are figuring out what should BSL level four security be like can look at lots of studies to understand exactly the organisms that already exist and how they would spread and how likely would they be to escape given these particular ventilation systems and so on. And even then, they mess things up decently often. But in this case, you're dealing with something that doesn't exist that we're not even sure when it will exist or what it will look like, and you're developing the thing at the same time that you're trying to figure out how to make it safe. It's just extremely difficult. And we should expect mistakes. That's something that we should keep in mind is that even if people who are doing their absolute best here are likely to mess up, and that's a reason why we need this defense-in-depth strategy that you're talking about, that we don't want to put all of our eggs in the RSP basket. We want to have many different layers, ideally. Yeah, it's also a reason to start early. So I think one of the things with Cloud 3 was, that was sort of the first model where we really ran this whole process. And I think some part of me felt like, wow, this is kind of silly. I was pretty confident Cloud 3 was not catastrophically dangerous. It was slightly better than GPT-4, which had been out for a long time and not caused a catastrophe. But I do think that the process of doing that, learning what we can, and then putting out public statements about how it went, what we learned, is the way that we can have this run really smoothly on the next time. We can make mistakes now, right? We could have made a ton of mistakes, because the stakes are pretty low at the moment. But in the future, the stakes on this will be really high, and it will be really costly to make mistakes. So it's important to get those practice runs in. All right, another kind of recurring theme that I've heard from some commentators is that, in their view, the Anthropic RSP just isn't conservative enough. So on that account, there should be kind of wider buffers in case you're under-eliciting capabilities that the model has that you don't realize, which is something that you're pretty concerned about. And I guess a different reason would be, you might worry that there could be discontinuous improvements in capabilities as you train bigger models with more data. So to some extent, model learning and improvement is, from a very zoomed-out perspective, is quite continuous. But on the other hand, its ability to do any kind of particular task, it can go from fairly bad to quite good surprisingly quickly. So there can be sudden unexpected jumps with particular capabilities. Yeah, firstly, can you maybe explain, again, in more detail, how the Anthropic RSP handles these safety buffers, given that you don't necessarily know what capabilities a model might have before you train it? That's quite a challenging constraint to be operating under. Yeah, so there are these red line capabilities. These are the capabilities that are actually the dangerous ones we don't want to train a model that has these capabilities until we have the next set of precautions in place. Then there are evaluations we're creating. And these evaluations are meant to certify that the model is far short of those capabilities. It's not, can the model do those capabilities? Because once we pass them, we then need to put all the safety mitigations in place, et cetera. And then when do we have to run those evaluations is sort of just, we have some heuristics of when the effective compute goes up by a certain fraction that is a very cheap thing that we can evaluate on every step of the run or something along those lines so that we know when to run it. In terms of how conservative they are, I guess one example it gives, if you're thinking about autonomy, where a model could spread to a bunch of other computers and sort of autonomously replicate across the internet, I think our evaluations are pretty conservative on that front. We test if it can replicate to a fully undefended machine, or if it can do some basic fine-tuning of a other language model to add a simple backdoor. I think these are pretty simple capabilities and there's always a judgment call there. We could set them easier, but then we might trip those and look at the model and be like, ah, this isn't really dangerous. It doesn't warrant the level of precaution that we're going to give it. Yeah. There was something also about, you said that, the RSP says that you'll be worried if the model can succeed at half the time at these various different tasks trying to spread itself to other machines. Yeah, why is half the time, like succeeding half the time the threshold? Yeah, so there's sort of a few tasks. I don't off the top of my head remember the exact thresholds, but basically it's just a reliability thing, right? So in order for a model to chain all of these capabilities together into some long running thing, it does need to have a certain success rate. Probably it actually needs a very, very high success rate in order for it to start autonomously replicating despite us trying to stop it, et cetera. So we set a threshold that's fairly conservative on that front. I guess it's part of the reason you're thinking, well, if a model can do this worrying thing half the time, then it might not be very much additional training away from being able to do it 99% of the time. That might just require some additional fine tuning to get there. And so then the model might be dangerous if it was leaked because it would be so close to being able to do this stuff. Yeah, I mean, that's often the case, although of course we could then elicit it to go if we'd set a higher number even if we got 10%, maybe that's enough that we could bootstrap it. So like often when you're training something, if it can be successful, you can reward it for that successful behavior and then increase the odds of that success. So it's often easier to sort of go from 10% to 70% than it is to go from like 0% to 10%. So if I understand correctly, the RSP proposes to retest models every time you increase the amount of training, compute or data by fourfold.
Is that right? That's kind of the checkpoint? Yeah. So we're still thinking about what is sort of the best thing to do there, and that one might change. But we use this notion of effective compute. So really, this has to do with when you train a model, it goes down to a certain loss. And we have these nice scaling laws of, if you have more compute, you should expect to get to the next loss. You might also have a big algorithmic win, where you don't use any more compute, but you get to a lower loss. And we sort of have coined this term effective compute, so that would sort of account for that as well. These jumps are sort of the jump where you can tell, we have sort of a visceral sense of how much smarter a model seems when you do that jump, and have sort of set that as our bar for when we have to run all these evaluations, which do require a staff member to go and run them, spend a bunch of time trying to elicit the capabilities, et cetera. I think this is somewhere I'm wary of sounding too precise, or like we understand this too well. We don't really know what the effective compute gap jump is between the yellow lines and the red lines. This is much more just like how we are thinking about the problem, and how we are trying to set these evaluations, and the reason that the yellow line evaluations really do need to be substantially easier. They'd be far from the red line capabilities, because you might actually overshoot the yellow line capabilities by a fairly significant measure just off of when you run evaluations. So I think, if I recall, it was Jvi, who's been on the show before, who wrote in his blog post assessing the Anthropic RSP, just that he thinks that this ratio between the 4x and the 6x is not large enough. If there is some discontinuous improvement, or you've really been under-eliciting the capabilities of the models at these kind of interim check-in points, that that does leave the possibility that you could overshoot, and get into quite a dangerous point by accident. And then by the time you get there, then the model's quite a bit more capable than what you thought it would be. And then you've got this difficult question of whether to, do you then press the emergency button and delete all of the weights because you've overshot? There'd be incentives not to do that, because you'd be throwing away a substantial amount of compute expenditure, basically, to create this thing. And this just worries him. That could be solved, I think, in his view, just by having a larger ratio there, having a larger safety buffer. Of course, that then runs the risk that you're doing these constant check-ins on stuff that you really are pretty confident is not going to be actually that dangerous, and people might get frustrated with the RSP and feel like it's wasting their time. So it's kind of a judgment call, I guess, how large that buffer needs to be. Yeah, I think it's a tricky one to communicate about because it's confidential what the jumps are between the models or something. I think one thing I can share is, we ran this on Cloud 3 partway through training. So the jump from Cloud 2 to Cloud 3 was bigger than that gap. So you could sort of think of that as an intelligence jump from Cloud 2 to Cloud 3 is bigger than what we're allowing there. I think it feels reasonable to me, but I think this is just a judgment call that different people can have. And I think that this is the sort of thing where if we learn over time that this seems too big or it seems too small, that's the type of thing that hopefully we can talk about publicly. Yeah. Is that something that you get feedback on? I suppose if you are training these big models and you're checking in on them, you can kind of predict where you expect them to be, how likely they are to exceed a given threshold. And then if you do ever get surprised, then that could be a sign that, wow, we need to increase the buffer range here. It's hard because the first one, the thing that would really tell us is if we don't pass the yellow line for one model and then on the next iteration, suddenly it blows past it. And we look at this and we're like, whoa, this thing is really dangerous. It's probably past the red line and we have to delete the model or immediately put in the security features, et cetera, for the next level. I think that would be a sign that we'd set the buffer too small. I guess, again, not the ideal way to learn that, but I suppose it definitely could set a cat amongst the pigeons. Yeah. There would be earlier signs where you would notice like, oh, we really overshot by a lot. It feels like we're closer than we expected or something. But that would sort of be the failure mode, I guess, rather than the warning sign. So reading the RSP, it seems pretty focused on kind of catastrophic risks from misuse, terrorist attacks or CBRN and AI gone rogue, spreading out of control, that sort of thing. Is it basically right that the RSP or this kind of framework is not intended to address kind of structural issues like AI displaces people from work and now they can't earn a living or AIs are getting militarized and that's making it more difficult to evolve, to prevent military encounters between countries because we can't control the models very well, or more like near-term stuff like algorithmic bias or deep fakes or misinformation. Are those kind of things that have to be dealt with by something other than a responsible scaling policy? Yeah, those are important problems. But the RSP is responsible for preventing sort of catastrophic risks and particularly has this framing that works well for things that are sort of acute, like new capability is developed and could sort of first order cause a lot of damage. It's not going to work for things that are like, what is the long-term effect of this on society over time? Because we can't design evaluations to test for that effectively. Anthropic does have different teams that work on those other two clusters that I talked about, right? What are they called? So we have a societal impacts team is probably the most relevant one to that. And the policy team also has a lot of relevance to these issues. All right. Yeah, I guess we're kind of going to wrap up on RSPs now. Is there anything you wanted to maybe say to the audience to wrap up this section? Like additional work that you think will be like ways that the audience might be able to contribute to this enterprise of coming up with better internal company policies and then figuring out, I suppose, how they could be models for other actors to come up with government policy as well? Yeah, I mean, I think this is just a thing that many people can work on. You know, if you work at a lab, you could talk to people there, think about what they should have as an RSP, if anything. If you work in policy, you should read these and think about if there are lessons to take. If you don't, don't do either of those. I think you really can think about threat modeling, post about that, think about evaluations, implement evaluations and share those. It's much, I think it is the case that, you know, these companies are very busy. And if there is something that's just like shovel ready or like ready on the shelf, you could just grab this evaluation. It's really quite easy to run them. So yeah, I think there's like quite a lot that people can do to help here. All right, let's push on and talk about the case that listeners might be able to contribute to making superintelligence go better by working at Anthropic on some of its various different projects. First though, how did you end up in your current role at Anthropic? What's been the career journey that led you there? Yeah, so I think it like largely started with an internship at GiveWell, which listeners might know, but it's a sort of a nonprofit that evaluates charities to figure out where to give money most effectively. And I did an internship there. I sort of learned a ton about global poverty, global health. I was planning to do a PhD in economics and go work on global poverty at the time. But a few people there were sort of pushed me and said, you know, you should really worry about AI safety. We're going to have these superintelligent AIs at some point in the future. And this could be a big risk. I remember I like left that summer internship and was like, wow, these people are crazy. I like talked to all my family and they were like, well, what are you thinking? But then I like, I don't know, it was like interesting. So I kept talking to people, some people there, other people sort of worried about this. And I felt like every debate I lost, I would like have a little debate with them about why we shouldn't worry about it. And I'd always come away feeling like I lost the debate, but not like fully convinced. And after like, honestly, like a few years of doing this, I eventually decided this was at least convincing enough that I should work in AI. It also turned out that working on poverty via this economics PhD route was a much longer and more difficult and like less likely to be successful path than I had anticipated. So I sort of pivoted over to AI. I worked at Vicarious, which is an AGI lab that had sort of shifted towards a robotics product angle. And I worked on computer vision there for a while, learning how to do ML research. And then actually 8000 Hours reached out to me and convinced me that I should work on safety more imminently. This sort of like AI was getting better. It was like more important that I just have some direct impact on doing safety research. At the time, I think OpenAI had by far the best safety research coming out of there. So I applied to work on safety at OpenAI. I actually got rejected. Then I got rejected again. In that time, Vicarious was nice enough to let me spend half of my time reading safety papers. So I was just sort of reading safety papers, trying to do my own safety research, although it was somewhat difficult. I didn't really know where to get started. And eventually, I also wrote for like Rohan Shah, who was on this podcast, this alignment newsletter. I read papers and wrote summaries and opinions for them for a while to motivate myself. But eventually, third try, I got a job offer from OpenAI, joined the safety team there and spent sort of eight months there, mostly working on code models and understanding how code models would progress. The logic here being, we just started the first LLMs training on code. And I thought it was pretty scary. If you think about recursive self-improvement, models that can write code is the first step. And trying to understand what direction that would go in would be really useful for sort of informing safety directions. And then a little bit after that, maybe like eight months in or so, all of the safety team leads at OpenAI left, most of them to start Anthropic. I sort of felt very aligned with their values and mission, so I also went to join Anthropic. Sort of the main reason I'd been at OpenAI was for the safety work. And then at Anthropic, actually, everyone was just building out infrastructure to train models. There was no code. It was sort of the beginning of the company. And I found that the thing was my comparative advantage was making them efficient. So I optimized the models to go faster. As I said, if you have more compute, you get a better model. So that means if you're a computer, if you can make things run quicker, you get a better model as well. I did that for a while and then shifted into management, which is something I wanted to do for a while.
and started sort of managing the pre-training team when it was five people, and then have been sort of growing the team since then, training better and better models along the way. Yeah, I'd heard that you'd been consuming 80,000 hour stuff years ago, but I didn't realize it had influenced you all that much. What was the step that we helped with? It was just deciding that it was important to actually start working on safety-related work sooner rather than later. Actually, a bunch of spots along the way. I think when I did that GitVolunteership, I did like a speed coaching at EA Global or something with 80,000 hours, and the people there were some of the people who were pushing me that I should work on AI, like some of those conversations. And then when I was at Vicarious, I think 80,000 hours reached out to me and was sort of like more pushy, and specifically was like, you should go to work directly on safety now, where I think I was otherwise sort of happy to just sort of keep learning about AI for a bit longer before shifting over to safety work. Well, that's, yeah, cool that ADK was able to, I guess, well, I don't know whether it helped, but I suppose it influenced you in some direction. Is there any stuff that you've read from ADK on AI careers advice that you think is mistaken, where you want to tell the audience that maybe they should do things a little bit differently than what we've been suggesting on the website, or I guess on this show? Yeah, first, I do want to say ADK was very helpful, both in pushing me to do it and setting me up with connections and introducing me to people and getting me a lot of information. It was really great. In terms of things that I maybe disagree with from standard advice, I think the main one would be to focus more on engineering than research. I think there is sort of this historical thing where people have focused on research more so than engineering. Maybe I should define the difference. The difference between research and engineering here would be that research can look more like figuring out what directions you should work on, designing experiments, doing really careful analysis and understanding that analysis, figuring out what conclusions to draw from a set of experiments. I can maybe give an example, which is you're training a model with one architecture and you're like, oh, I have an idea. We should try this other architecture. And in order to try it, the right experiments would be these experiments and these would be the comparisons to confirm if it's better or worse. Engineering is more of the implementation of the experiment. So then taking that experiment, trying it, and also creating tooling to make that fast and easy to do. So make it so that you and everyone else can really quickly run experiments. It could be optimizing code, so making things run much faster. As I mentioned, I did for a while. Or making the code easier to use so that other people can use it better. And these aren't like, it's not like someone's an engineer or a researcher. You kind of need both of these skill sets to do work. You come up with ideas, you implement them, you see the results, then you implement changes and it's a fast iteration loop. But it's somewhere where I think there's historically been more prestige given to the research end, despite the fact that most of the work is the engineering end. So you end up with like, if you come up with your architecture idea that takes an hour, and then you spend like a week implementing it. And then you run your analysis and that maybe takes a few days. But it sort of feels like the engineering work takes the longest. And then my other pitch here is going to be that the one place where I've often seen researchers not investigate an area they should have is when the tooling is bad. So when you go to do research on this area and you're like, oh, it's really painful. All my experiments are slow to run. It will really quickly have people be like, I'm going to go do these other experiments that seem easier. So often by creating tooling to make something easy, you actually can like open up that direction and sort of like trailblaze a path for a bunch of other people to sort of follow along and do a lot of experiments. So what fraction of people at Anthropic would you classify as like more on the engineering end versus more on the research end? I might go with my team because I think I actually don't know for all of Anthropic. And I think it's sort of a spectrum, but I would guess it's probably like 60 or 70% of people would be, I would say are like probably stronger on the engineering end than on the research end. And when hiring, I'm like most excited about finding people who are strong on the engineering end. And most of our interviews are sort of tailored towards that. Not because the research isn't important, but because I think it's sort of, there's a little bit less need for it. The distinction sounds like a little bit artificial to me. Is that kind of true? It feels like these things are, they're kind of all just a bit part of a package. Yeah, although I think the main distinction with engineering is that it is a fairly separate career. So I think there are many people, hopefully listening to this podcast, who might have sort of been a software engineer at some tech company for a decade and built up a huge amount of expertise and experience with designing good software and such. And those people, I think, can actually learn the ML they sort of need to know to do the job effectively very, very quickly. And I think there's maybe another direction people could go in, which is much more, I think of it as a PhD in many cases, where you spend a lot of time developing research taste, figuring out what are the right experiments to run, running these, usually at smaller scale and maybe with less of sort of a single long lift code base that pushes you to develop better engineering practices. And I think that skill set, and to be clear, this is a relative term, it's also a really valuable skill set and you always need a balance. But I think I've often had the impression that 80,000 hours pushes people more in that direction who want to work on safety. More the, do a PhD, become sort of like a research expert who has really great research taste than pushing people more on the sort of become a really great software engineer direction. Yeah, we had a podcast many years ago, it might be 2018 or 2017 with Catherine Olsen and Daniel Ziegler where they were also saying engineering is the way to go or engineering is the thing that's really scarce and it's also the easier way into the industry. But yeah, it isn't a drum that we've been banging all that frequently. I don't think we've talked about it very much since then. So perhaps that's a bit of a mistake that we haven't been highlighting the engineering roles more. You said it's a kind of a different career track. So you can go from software engineering to the ML or like AI engineering that you're doing at Eptopic. Is that an actual career progression that someone has? Or like someone who's not already in this, how can they learn the engineering skills that they need? Yeah, so I think engineering skills are actually in some ways the easiest to learn because there's so many different engineering places. I think the way I would recommend it is you could work at any engineering job. Usually I would say just working with the smartest people you can, building the most complex systems. You can also just do this open source. You can contribute to an open source project. This is often a great way to get mentorship from the maintainers and have something that's publicly visible. So if you then want to apply to a job, you can be like, here is this thing I made. And then you can also just create something new. So you can say, I think if you want to work on AI engineering, you should probably pick a project that's similar to what you want to do. So if you want to work on data for large language models, take Common Crawl, it's a publicly available crawl of the web, and write a bunch of infrastructure to process it really efficiently. Then maybe train some models on it, like build out some infrastructure to train models. And you can sort of build out that skill set relatively easily without needing to work somewhere. Why do you think people have been overestimating research relative to engineering? Is it just that research sounds cooler? Has it got better branding? I think historically it was a prestige thing. I think there's sort of this distinction between research scientist and research engineer that used to exist in the field where research scientists had PhDs and were kind of designating the experiments that the research engineers would run. And I think that shifted a while ago. So I think in some sense the shift has already started happening. Now many places, Drop Included, everyone's a member of technical staff. There isn't sort of this distinction. And the reason is that the engineering got more important, particularly with scaling. Once you got to the point where you were training models that used a lot of compute on a big distributed cluster, the engineering to implement things on these distributed runs got much more complex than when it was sort of more quick experiments on cheap models. To what extent is it a bottleneck just being able to build these enormous compute clusters and operate them effectively? Is that a core part of the stuff that Anthropic has to do? Yeah, so we rely on like cloud providers to actually build the data centers and like put the chips in it. But we've now reached a scale where the amount of compute we're using is sort of, it's a very dedicated thing. These are really huge investments and we're sort of involved collaborating on it from sort of the design up. And I think it's a very critical piece, right? Like given that compute is the main driver, the ability to take a lot of compute and use it all together and to design things that are cheap, given the types of workloads you want to run, can be like a huge multiplier on how much compute you have. All right, yeah, do you want to give us the pitch for working at Anthropic as a particularly good way to make the future with super intelligent AI go well? Yeah, I may pitch like working on AI safety first. Sure, yeah. The case here is it's just like really, really important. Yeah, I think like AGI is going to be like probably the biggest technological change ever to happen. The thing I think I keep in my mind is just like, what would it be like to have every person in the world able to spin up a company of a million people, all of whom are as smart as like the smartest people you know and task them with any project they want? And you know, you could do a huge amount of good with that. You could like help cure diseases, you could tackle climate change, you could work on poverty. There's sort of a ton of stuff you can do that would be great. But there's also a lot of ways it could go really, really badly. So I just think like the stakes here are like, are really high. And then there's a pretty small number of people working on it. If you sort of account for all the people working on like things like this, I think you're probably gonna get something in like the thousands right now, maybe 10s of thousands. It's rapidly increasing, but it's quite small compared to the scale of the problem. In terms of why Anthropic, I think my main case here is just like the, I think the best way to like make sure things go well is to get a bunch of people who care about the same thing and all work together with that as the main focus. Yeah, I mean, Anthropic is not perfect. We definitely have issues as does every organization. But I think one thing that I've really appreciated is just seeing how much progress we can make when there's like a whole team where everyone trusts each other, like kind of deeply.
shares the same goals and can like work on that together. I guess, I mean, there is a bit of a trade-off between, if you imagine there's a kind of a pool of people who are very focused on AI safety and kind of have the attitude that you just expressed. One approach would be to split them up between each of the different companies that are working on frontier AI. And I guess that would have some benefits. The alternative would be to cluster them all together in a single place where they can work together and make a lot of progress. But perhaps the things that they learn won't be as easily diffused across all of the other companies. Yeah, do you have a view on where the right balance is there between kind of clustering people so they can work together more effectively, communicate more versus the need perhaps to have people everywhere who can absorb the work? I just think the benefits from working together are really huge. Like, I think it's just, it's so different what you can like accomplish when you have like five people all working together as opposed to five people like working independently and unable to sort of speak to each other, like communicate about what they're doing. You sort of run the risk of just doing everything, everything in parallel, not learning from each other, and also not building trust, which I think is just somewhat a core piece of eventually being able to work together to implement the things. So in as much as Anthropic is or becomes like the main leader in interpretability research and other kind of lines of technical AI safety research, do you think it is the case that other companies are gonna be very interested to absorb that research and apply it to their own work? Or is there a possibility that Anthropic will have really good safety techniques, but then they might get stuck in Anthropic and potentially the most capable models that are being developed elsewhere are developed without them? Yeah, so I think that my hope is that if other people have either develop RSP-like things or if there are regulations sort of requiring particular safety mitigations, people will want, will have a strong incentive to want to get better safety practices. And we publish our safety research. So in some ways we're making it as easy as possible as we can for them. We're like, here's all the safety research we've done. Here's as much detail as we can give about it. Please go reproduce it. Beyond that, I think we're kind of, it's hard to be accountable for what other places do. And I think to some degree, it just makes sense for Anthropic to try to like set an example and be like, we can be a frontier lab while still prioritizing safety, putting out a lot of safety work and hoping that kind of inspires others to do the same. Do you know, I don't know what the answer to this is, do you know researchers at Anthropic sometimes go and visit other AI companies and vice versa in order to like cross-pollinate ideas? I think that used to maybe happen more and maybe things have gotten a little bit tighter the last few years, but that's one idea that you could hope that research might get passed around, or at least, I mean, you're saying it gets published. I guess that's important, but there is a risk that the technical details of how you actually apply the methods won't always necessarily be in the paper or be very easy to figure out. So you also often need to talk to people to make things work. Yeah, I think once something's published, you can go and give talks on it and et cetera. I think publishing is sort of the first step where until it's published, then it's like confidential information that can't be shared. So yeah, it's sort of like you have to first figure out how to do it, then publish it. There are more steps you could take, right? You could then like open source code that enables you to run it more carefully. There's a lot of work you could go in that direction, and then it's just sort of a balance of how much time you spend on disseminating your results versus pushing your agenda forward to actually make progress. It's possible that I'm slightly, I'm analogizing from, I guess, biology that I'm somewhat more familiar with where it's notorious that having a biology paper or a medical paper does not allow you to replicate the experiment because there's so many important details missing. But is it possible that in ML, in AI, people tend to just publish all of the stuff or all of the data maybe and all of the code online on GitHub or whatever, such that it's like much more straightforward to completely replicate a piece of research elsewhere? Yeah, I think it's much, much, it's a totally different level of replication. It depends on the paper, but on many papers, if a paper's like published in some conference, I would sort of expect that someone can pull up the paper and reimplement it with like maybe a week's worth of work. There's a strong norm of sometimes providing the actual code that you need to run, but providing enough detail that you can. I think with some things, it can be tricky, where like, I don't know, our Turbulent team just put out a paper on how to get features on one of our production models. And we didn't release details about our production model. So we tried to include enough detail that someone could replicate this on another model, but they probably can't, they can't exactly like create our production model and like get the exact features that we have. Okay, in a minute, we'll talk about one of the concerns that people might have about working with, working at any AI company. But in the meantime, yeah, what roles are you hiring for at the moment and what roles are likely to be open at Anthropic in future? So probably just check our website. There's like quite a lot. I'll kind of highlight a few. So I think the first one I should highlight is the RSP team is looking for people to develop evaluations, work on the RSP itself, figure out like what the next version of the RSP should look like, et cetera. On my team, we're hiring a bunch of research engineers. So this is come up with approaches to improve models, implement them, analyze the results, kind of pushing that loop. And then also performance engineers. This one's maybe like a little bit more surprising, but a lot of the work now happens on custom AI chips and making those run really efficiently is sort of absolutely critical. There's a lot of interplay between how fast it can go and how good the model is. So we're sort of hiring quite a number of performance engineers where it's much, you don't need to have a ton of AI expertise, just having like deep knowledge of how hardware works and how to write code really efficiently. How can people learn that skill? Are there courses for that? There are probably courses. I think with basically everything, I would recommend finding a project, finding like someone to mentor you and be cognizant of their time. Maybe you like spend a bunch of time writing up some code and you send them a few hundred lines and say, can you review this and help me? And help me, or maybe you've got some weekly meeting where you ask questions. But yeah, I think you can read about it online. You can take courses or you can just pick a project and say, I'm going to implement a transformer as fast as I possibly can and sort of hack on that for a while. Are most people coming into Anthropic from other AI companies or like the tech industry more broadly or from PhDs or maybe not even PhDs? It's quite a mix. I think like PhD is definitely not necessary. I think it's like one direction to go to build up the skill set. We have a shockingly large number of people with physics backgrounds who have done theoretical physics for a long time and then sort of spend some number of months learning the engineering to be able to like write Python really well, essentially, and then switch in. So I think there's not really a particular background that is sort of needed. It's just, I would say if you're directly preparing for it, just pick the closest thing you can to the job and do that to prepare, but don't feel like you needed to have some particular background in order to apply. This question is slightly absurd because there's such a range of different roles that people could potentially apply for at Anthropic. But do you have any kind of any advice for people who would like, the vision for their career is working at Anthropic or something similar, but they don't yet feel like they're qualified to get a role at such a serious organisation? How can they go, what are some interesting underrated paths maybe to gain experience or skills so that they can be more useful to the project in future? Yeah, I would just pick the role you want and then do it externally, do it in a very publicly visible way, get advice, and then apply with that as an example. So like if you want to work on interpretability, make some tooling to pull out features of models and post that on GitHub or publish a paper on interpretability. If you want to work on the RSP, then make a really good evaluation, post it on GitHub with a nice write-up of how to run it and include that with your application. This sort of takes time and it's hard to do well, but I think that it's both the best way to know if it's really the role you want. And when hiring for something, I have a role in mind and I want to know if someone can do it and if someone has shown, look, I'm already doing this role, here's my proof I can do it well, that's the most convincing case. In many ways, more so than the signal you'd get out of an interview where all you really know is they did well on this particular question. So in terms of working at AI companies, regular listeners will recall that earlier in the year I spoke with JV Marsiewicz, who's a long-time follower of advances in AI and I'd say is a bit on the pessimistic side about AI safety and maybe also not, I think he likes the Anthropic RSP, but he's not convinced that any of the safety plans put forward by any company or any government are at the end of the day, going to be quite enough to keep us safe from rapidly self-improving AI. And he said that he was pretty strongly against people taking capabilities roles that would kind of push forward the frontier of what the most powerful AI models can do, I guess, especially at leading AI companies. Because I mean, the basic argument is just that those roles are causing a lot of harm because they're speeding things up and leaving us less time to solve whatever kind of safety issues we're going to need to address. And I pushed back a little bit and he wasn't really convinced by the various kind of justifications that one might give, like the need to gain skills, that you could then apply to safety work later, or maybe you'd have the ability to influence a company's culture by being on the inside rather than the outside. I think of all companies, JV, I would certainly imagine, is most sympathetic to Anthropic. But I guess his philosophy is very much to kind of rely on hard constraints rather than put trust in any particular individuals or organizations that you like. I'm guessing that kind of, you might've heard what JV had to say in that episode. And I guess it was a critique that arguably applies to your job at training Cloud3 and other frontier LLMs. So I'm kind of fascinated to hear what you thought of JV's perspective there. So I think there's like one argument, which is to do this to build.
career capital, and then there's another that is like to do this for direct impact. I think on the career capital one, I'm pretty skeptical. I think career capital is sort of weird to think about in this field that's growing exponentially. I think in a normal field, people often say you're the most impact late in your career. You've built up skills for a while, and then maybe your 40s or 50s is when you have the most impact of your career. But given the rapid growth in this field, I think actually the best moment for impact is now. I don't know. I often think of in 2021, I was organic therapy. I think there were probably tens of people working on large language models, which I thought were the main path towards AGI. Now there are thousands. I've improved. I've gotten better since then, but I think probably I had way more potential for impact back in 2021 when there were only tens of people working on it. Your best years are behind you, Nick. Yeah. I think the potential was very high. I still think that there's still a lot of room for impact. It will maybe decay, but it's from an extremely high level. And then the thing is just the field isn't that deep. Because it's such a recent development, it's not like you need to learn a lot before you can contribute. I think if you want to do physics and you have to learn the past thousands of years of physics before you can push the frontier, that's a very different setup from where we're kind of at. And maybe my last argument is just like, if you think timelines are short, depending exactly how short, there's just actually not that much time left. So if you think there's five years and you spent two of them building up a skill set, that's a significant fraction of the time. I'm not saying that that should be someone's timeline or anything, but the shorter they are, the less that makes sense. So yeah, I think from a career capital perspective, I'd probably agree. Does that make sense? Yeah. Yeah. And what about from other points of view? Yeah, I think from a direct impact perspective, I'm fairly less convinced. Part of this is just that I don't have this framing of there's capabilities and there is safety, and they are separate tracks that are racing. I think that's one way to look at it, but I actually think they're really intertwined. And a lot of safety work relies on capabilities advances. I can give this example of this many-shot jailbreaking paper that one of our safety teams published, which uses long context models to find a jailbreak that can apply to Claude and other models. And that research was only possible because we had long context models that you could test this on. So I think there's just a lot of cases where the things come together. But then I think if you're going to work on capabilities, you should be really thoughtful about it. I do think there is a risk. You are speeding them up. In some sense, you could be creating something that is really dangerous, but I don't think it's as simple as just don't do it. I think you want to think all the way through to what is the downstream impact when someone trains AGI, and how will you have affected that? And that's a really hard problem to think about. There's a million factors at play, but I think you should think it through, come to your best judgment, and then re-evaluate and get other people's opinions as you go. Some of the things I might suggest doing, if you're considering working on capabilities at some lab, is try to understand their theory of change. Ask people there, ask, how does your work on capabilities lead to a better outcome? And see if you agree with that. I would talk to their safety team, talk to safety researchers externally, get their take. Say, do they think that this is a good thing to do? And then I would also look at their track record and their governance and all the things to answer the question of, do you think they will push on this theory of change? Over the next five years, are you confident this is what will actually happen? One thing for me that convinced me at Anthropic that I was maybe not doing evil or maybe feel much better about it is that our safety team is willing to help out with capabilities and actually wants us to do well at that. Early on with Opus, before we launched it, we had a major fire. There was a bunch of issues that came up, and there was one very critical research project that my team didn't have capacity to push forward. I asked Ethan Perez, who's one of the safety leads at Anthropic, can you help with this? It was actually during an offsite. Ethan and most of his team basically went upstairs to this building in the woods that we had for the offsite and cranked out research on this for the next two weeks. For me, at least, I was like, ah, yes, the safety team here also thinks that us staying on the frontier is critical. The basic idea is you think that the safety research of many different types that Anthropic is doing is very useful. It sets a great example. It's research that could then be adopted by other groups and also used by Anthropic to make safe models. The only way that that can happen, the only reason that that research is possible at all, is that Anthropic has these frontier LLMs on which to experiment and do their research and to be at the cutting edge, generally, of this technology and so able to figure out what's the safety research agenda that is most likely to be relevant in future. If I imagine what would Gervais say? I'm going to try to model him. I guess that he might say, yes, given that there's this competitive dynamic forcing us to shorten timelines, bringing the future forward maybe faster than we feel comfortable with, maybe that's the best you can do, but wouldn't it be great if we could coordinate more in order to buy ourselves more time? I guess that would be one angle. Another angle that I've heard from some people, I don't know whether G would say this or not, is that we're nowhere near actually having all the safety relevant insights that we can have with the models that we have now. Given that there's still such fertile material with Claude II maybe, or at least with Claude III now, why do you need to go ahead and train Claude IV? Maybe it's true that five years ago, when we were so much further away from having AGI or having models that were really interesting to work with, we were a little bit at a loose end trying to figure out what safety research would be good because we just didn't know what direction things were going to go. But now, there's so much safety research, there's a proliferation, a Cambrian explosion of really valuable work. We don't necessarily need more capable models than what we have now in order to to discover really valuable things. What would you say to that? In the first one, I think there's sometimes this, what is the ideal world if everyone was me, or something? If everyone thought what I thought, what would be the ideal setup? And I think that's just not how the world works. And I think to some degree, you really only can control what you do, and maybe you can influence what a small number of people you talk to do. But I think you have to think about your role in the context of the broader world more or less acting in the way that they're going to act. So yeah, it's definitely a big part of why I think it's important for anthropo-tech capabilities is to enable safety researchers to have better models. Another piece of it is to enable us to have an impact on the field and try to set this example for other labs that you can deploy models responsibly and do this in a way that doesn't cause catastrophic risks and continues to push on safety. In terms of can we do safety research with current models, I think there is definitely a lot to do. I also think we will target that work better the closer we get to AGI. I think the last year before AGI will definitely be the most targeted safety work. Hopefully there'll be the most safety work happening then, but it will be the most time-constrained. So you need to do work now because there's a bunch of serial time that's needed in order to make progress, but you also want to be ready to make use of the most well directed time towards the end. I guess another concern that people have, which you touched on earlier, but maybe we could talk about a little bit more, is this worry that Anthropic by existing, by competing with other AI companies, it stokes the arms race, increases the pressure on them feeling that they need to improve their models further, put more money into it, release things as quickly as they can. If I remember, your basic response to that was like, yes, that effect's not zero, but in the scheme of things, there's a lot of pressure on companies to be trading models and trying to improve them. Anthropic is kind of a drop in the bucket there, and so this isn't necessarily the most important thing to be worrying about. Yeah, I think basically that's pretty accurate. I think one way I would think about it is, what would happen if Anthropic stopped existing? If we all just disappeared, what effect would that have in the world? Or if you think about if we dissolved as a company and everyone went to work at all the others. My guess is it just wouldn't look like everyone slows down and is way more cautious. That's not my model of it. If that was my model, then I would be like, ah, we're probably doing something wrong. I think it's an effect, but I think I think about it in terms of what is the net effect of Anthropic being on the frontier when you account for all the different actions we're taking, all the safety research, all the policy advocacy, the effect our products have helping users. There's this whole large scheme and you can't really add it all up and subtract the costs, but I think you can do that somewhat in your mind or something. Yeah, I see. The way you conceptualize it is thinking, Anthropic as a whole, what impact is it having by existing compared to some counterfactual where Anthropic wasn't there? And then you're contributing to this broader enterprise that is Anthropic and all of its projects and plans together, rather than thinking about, today I got up and I helped to improve Code 3 in this narrow way. What impact does that specifically have? Because it's missing the real effects that matter the most from allowing this organization to exist through your work. Yeah, you could definitely think on the margin. To some degree, if you're joining and going to help with something, you are just increasing Anthropic's marginal amount of capabilities. And then I would just look at, do you think we would be on a better trajectory if Anthropic had better models? And do you think we'd be on a worse trajectory if Anthropic had significantly worse models? That would be the comparison. I think you could look at that.
like, oh, well, what would happen if Anthropic didn't ship Cloud 3 earlier this year? What are some of the lines of research that you're most pleased that you've helped Anthropic to pursue? What are some of the kind of safety wins that you're really pleased by? I'm really excited about the safety work. I think there's just like a ton of it that has come out of Anthropic. I can sort of start with interpretability where I think at the beginning of Anthropic, it was like figuring out how single layer transformers work, these sort of very simple toy models. And in sort of the past few years, this is not my doing, this is all the interpretability team, that has sort of scaled up into actually being able to look at production models that people are really using and find useful and identify particular features. We sort of had this recent one on the Golden Gate Bridge where they found a feature that if you increase it, it's the model's representation of the Golden Gate Bridge. And if you increase it, the model talks more about the Golden Gate Bridge. And that's sort of like a very cool causal effect where you can change something and it actually changes the model behavior in a way that gives you more certainty you've really found something. I'm not sure whether all listeners will have seen this, but it is very funny because its mind is constantly turned to thinking about the Golden Gate Bridge, even when the question has nothing to do with it. And it gets frustrated with itself, realizing that it's going off topic and then tries to bring it back to the thing that you asked, but then it just can't avoid talking about the Golden Gate Bridge again. I hope that you could find the honesty part of the model and scale that up enormously. Or alternatively, find the deception part and scale that down in the same way. Yeah, so there actually are a bunch. If you look at the paper, there's a bunch of safety-relevant features. I think that the Golden Gate Bridge one was cuter or something and got a bit more attention. But yeah, there are a ton of features that are really safety-relevant. I think one of my favorites was one that will tell you if code is incorrect or something, or has a vulnerability, something along those lines. And then you can change that and suddenly it doesn't write the vulnerability or it makes the code correct. And that sort of shows the model knows about concepts at that level. Now, can we use this directly to solve major issues? Probably not yet, right? There's a lot more work to be done here. But I think it's just been a huge amount of progress. And I think that it's fair to say that progress wouldn't have happened without Anthropic's interpretability team pushing that field forward a lot. Yeah. Is there any other Anthropic research that you're proud of? Yeah. I mentioned this one a little bit earlier, but there's this multi-shot jailbreaking from our alignment team that pushed, if you have a long context model, which is something that we released, you can jailbreak a model by just giving it a lot of examples in this very long context. And it's a very reliable jailbreak to get models to do things you don't want. So this is in the vein of the RSP. One of the things we want to have is to be able to be robust to really intense red teaming, where if a model has a dangerous capability, you can have safety features that prevent people from listening it. And this is an identification of a major risk for that. We also have this sleeper agents paper, which shows the early signs of models having deceptive behavior. Yeah, I could talk about a lot more of it. There's actually just a really huge amount. And I think that's fairly critical here. I think often with safety things, people get focused on inputs and not outputs or something. And I think the important thing is to think about how much progress are we actually making on the safety front? That is ultimately what's going to matter in some number of years when we get close to AGI. It won't be how many GPUs did we use, how many people worked on it. It's going to be what did we find and how effective were we at it. And for product, this is very natural. People think in terms of revenue. How many users did you get? You have these end metrics that are the fundamental thing you care about. And I think for safety, it's much fuzzier and harder to measure. But putting out a lot of papers that are good is quite important. Yeah. I mean, if you want to keep going, if there's any others that you want to flag, I'm in no hurry. Yeah. I mean, talking about influence functions, I think this is a really cool one. So one framing of mechanistic interoperability is it lets us look at the weights and understand why a model has a behavior by looking at a particular weight. The idea of influence functions is to understand why a model has a behavior by looking at the training data. So you can understand what in your training data contributed to a particular behavior from the model. Yeah. I think that was pretty exciting to see work. I think constitutional AI is another example I would highlight where we can train a model to follow a set of principles via sort of AI feedback. So instead of having to sort of have human feedback for a bunch of things, you can just write out a set of principles. I want the model to not do this. I want to not do this. I want to not do this and train the model to follow that constitution. Is there any work at Anthropic that you personally would be wary or at least not enthusiastic to contribute to? So I think in general, this is a good question to ask. I think the work I'm doing is currently the highest impact thing. And I think I should frequently wonder if that's the case and talk to people and reassess. I think right now, I don't think there's any work at Anthropic that I wouldn't contribute to or think shouldn't be done. It's probably not the way I would approach it. If there was something that I thought Anthropic was doing that was bad for the world, I would write a doc making my case and send it to the relevant person who's responsible for that, and then have a discussion with them about it. Because just opting out isn't going to actually change it. Someone else will just do it. That doesn't accomplish much. And we try to sort of operate as one team where everyone is aiming towards the same goals and not have this sort of two different teams are at odds where you're hoping someone else won't succeed. I guess people might have a reasonable sense of the culture at Anthropic just from listening to this interview. But is there anything else that's interesting about working at Anthropic that might not be immediately obvious? I think the one thing that is part of our culture that at least surprised me is spending a lot of time pair programming. This is just a very collaborative culture. So I think when I first joined, I don't know, I was working on a particular method of distributing a language model training across a bunch of GPUs. And Tom Brown, who's one of the founders and had done this for GPT-3, just put an eight-hour meeting on my calendar. And I just watched him code it. And then I was on a different time zone. So basically during the hours when he wasn't working and I was working, I would push forward as far as I could. And then the next day we would meet again and continue on. And I think it's just a really good way of aligning people where it's a shared project instead of you're bothering someone by asking for their help. It's like you're working together on the thing and you learn a lot. You also learn a lot of the smaller things that you wouldn't otherwise see. Like how does someone navigate their code editor? What exactly is their style of debugging this sort of problem? Whereas if you go and ask them for advice or like, how do I do this project? They're not going to tell you the low level details. When do they pull out a debugger versus some other tool for solving the problem? So this is literally just watching one another's screens or you're doing a screen share thing where you watch? Yeah. I'll give some free advertising to Tuple, which is this great software for it where you can share screens and you can control each other's screens and draw on the screen. And typically one person will drive, they'll be basically doing the work and the other person will watch, ask questions, point out mistakes, occasionally grab the cursor and just change it. It's interesting that I feel in other industries having your boss or a colleague stare constantly at your screen would give people the creeps or they would hate it. Whereas it seems like in programming, this is something that people are really excited by and they feel like it enhances their productivity and makes the work a lot more fun. Oh yeah. I mean, it can be exhausting and tiring. I think the first time I did this, I was too nervous to take a bathroom break. And after multiple hours, I was like, can I go to the bathroom? And I realized that was an absurd thing to ask after multiple hours of working on something. It can definitely feel a little bit more intense in that someone's watching you and they might give you feedback on like, ah, you're kind of going slow here. This sort of thing would speed you up. But I think you really can learn a lot from that intensive partnering with someone. All right. I think we've talked about Anthropic for a while. I guess the final question is, so obviously Anthropic, its main office is in San Francisco, right? And I heard that it was opening a branch in London. Are those the two main places? And are there many people who work remotely or anything like that? Yeah. So we have the main office in SF and then we have office in London, Dublin, I think, uh, Seattle and New York. Our typical policy is like 25% time in person. So some people will mostly work remotely and then go to sort of one of the hubs for, it's usually one week per month. Uh, the idea of this is that you should, we want people to like build trust with each other and like be able to work together well and like know each other. And that, that sort of involves some amount of social interaction with, with your coworkers, but also for a variety of reasons, like sometimes getting the best people that people are bound to particular locations. I kind of have been assuming that all of the main AI companies are probably hiring hand of a fist. And then I guess I know, um, Anthropic's received like big investment from Amazon, maybe some other folks as well, but is it, yeah. Does it feel like the organization is growing a lot that there's lots of new people around all the time? Yeah. Growth has been very rapid. We recently moved into a new office. Before that we were, uh, we'd run out of desks, which was an interesting moment for, for the company. It was, it was very crammed. Now there's space. Uh, yeah. I mean, rapid growth is a very like difficult challenge, but also a very interesting one to work on. I think that's like to some degree, what I spend a lot of my time thinking about is just how can, how can we grow, grow the team and be able to maintain this sort of linear growth and productivity is sort of the dream, right? If you double the number of people, you get twice as much done and you never actually hit that, but it's, it takes a lot of work because there's now all this communication overhead and you have to do a bunch to make sure everyone's like working towards the same goals, sort of maintain the culture that we currently have. I've given you a lot of time to talk about what's great about Anthropic, but I should at least ask you what's, what's, what's kind of worst about Anthropic? What would you most like to see, to see improve? I think honestly, the first thing that comes to mind is just the like stakes of what we're working on or something. I think that there was sort of a period a few years ago where I felt like, ah, safety is really important. I felt like motivated.
It was a thing I should do and got value out of it, but I didn't feel this sort of, oh, it could be really urgent. Decisions I'm making are just really, really high-stakes decisions. I think Anthropic definitely feels high-stakes. I think it's often portrayed as this do-me culture. I don't think it's that. I think there is a lot of benefits. I think I'm pretty excited about the work I'm doing, and it's quite fun on a day-to-day basis. But it does feel very high-intensity, and so many of these decisions, they really do matter. If you really think we're going to have the biggest technological change ever, and how well that goes depends in a large part on how well you do at your job on that given day. No pressure. Yeah. And the timelines are really fast, too. Even commercially, you can kind of see that it's months between major releases, and that puts a lot of pressure, where if you're trying to keep up with the pace of AI progress, it is quite difficult, and it relies on success on very short timelines. Yeah. So for someone who has relevant skills, might be a good employee, but maybe they struggle to operate at super-high productivity, super-high energy all the time, could that be an issue for them at a place like Anthropic, where it sounds like there's a lot of pressure to deliver all the time? I guess, potentially internally, but also just the external pressures are pretty substantial. Yeah, I think that some part of me wants to say yes. It's really important to be very high-performing a lot of the time. The standard of always do everything perfect all of the time is not something anyone meets, and I think it is important sometimes to just keep in mind that all you can do is your best effort. So we will mess things up, and even if it's high stakes and that's quite unfortunate, it's unavoidable. No one is perfect, so I wouldn't set too high of a, I couldn't possibly handle that. I think people really can, and you can grow into that and get used to that level of pressure and how to operate under it. Alright, I guess we should wrap up. We've been at this for a couple of hours, but I'm curious to know what is an AI application that you think is overrated and maybe going to take longer to arrive than people expect? Maybe what's an application that you think might be underrated and consumers might be really getting a lot of value out of surprisingly soon? Overrated? I think when overrated, I think people are often like, oh, I should never, I'll never have to use Google again, or it's a great way to get information. I find that I still, if I just have a simple question and I want to know the answer, just Googling it will give me the answer quickly, and it's almost always right. Whereas if I could go ask Claude, it'll sample it out and I'll be like, is it true? Is it not true? It's probably true, but it's in this conversational tone. I think that's one that doesn't yet feel like the strengths. The place where I find the most benefit is coding. I think this is not a super generalizable case or something, but if you're ever writing software, or if you've thought I don't know how to write software, but I wish I did, the models are really quite good at it. If you can get yourself set up, you can probably just write something out in English and it will spit out the code to do the thing you need rather quickly. And then the other thing is problems where I don't know what I would search for. I have some question, I want to know the answer, but it relies on a lot of context. It would be this giant query. Models are really good at that. You can give them documents. You can give them huge amounts of stuff and explain really precisely what you want, and then they will interpret it and give you an answer that accounts for all the information you've given them. I think I do use it mostly as a substitute for Google, but not for simple queries. It's more like something kind of complicated where I feel like I'd have to dig into some articles to figure out the answer. I think one that jumps to mind is Francisco Franco was kind of on the side of the Nazis during World War II, but then he was in power for another 30 years. Did he ever comment on that? What did he say about the Nazis later on? And I think Claude was able to give me an accurate answer to that, whereas I probably could have spent hours maybe trying to look into that, trying to find something. The answer is he mostly just didn't talk about it. My other favorite one, which is a super tiny use case, is if I ever have to format something and do something, like if there's just some giant list of numbers that someone sent me in a Slack thread and it's bulleted and I want to add them up, I can just copy-paste it into Claude and say, add the things up. It's very good at taking this weird thing, kind of structuring it, and then doing a simple operation. All of these models are really good at programming. I've never programmed before, really, and I've thought about maybe I could use them to make something of use. But I guess I'm at such a basic level, I don't even know where, so I would get the code and then where would I run it? Is there a place that I can look this up? I think you basically want to just look up a, I would suggest Python, get an introduction to Python and get your environment set up. You'll eventually run Python and then some file and you'll hit enter and that will run the code. That part's annoying. Claude could help you if you run into issues setting it up. But once you have it set up, you can just be like, write me some code to do x and it will write that not perfectly, but pretty accurately. I guess I should just ask Claude for guidance on this as well. I've got a kid who's a couple of months old. I guess in three or four years' time they'll be going to preschool and then eventually starting reception, primary school. I guess my hope is that by that time, AI models might be really involved in the education process and kids will be able to get a lot more, well, maybe it would be very difficult to keep a five-year-old focused on the task of talking to an LLM, but I would think that we're close to being able to have a lot more individualized attention from educators, even if those educators are AI models. This might enable kids to learn a lot faster than they can when there's only one teacher split between 20 students or something like that. Do you think that kind of stuff will come in time for my kid first going to school or might it take a bit longer than that? I can't be sure, but yeah, I think there will be some pretty major changes by the time your kid is going to school. Okay, yeah, that's good. That's one that I really don't want to miss on the timeline. We can... I'm worried, like Nathan LeBenz, I'm worried about hyperscaling, but on a lot of these applications I really just want them to reach us as soon as possible because they do seem so useful. My guest today has been Nick Joseph. Thanks so much for coming on the 80,000 Hours Podcast, Nick. Thank you. If you're really interested in the pretty vexed question of whether all things considered it's good or bad to work at the top AI companies if you want to make the transition to superhuman AI go well, our researcher Adam Kaler has just published a new article on exactly that titled, Should You Work at a Frontier AI Company? You can find that by Googling 80,000 Hours and Should You Work at a Frontier AI Company or heading to our website 80,000Hours.org and just looking through our research. And finally, before we go, just a reminder that we are hiring for two new senior roles at 80,000 Hours, a head of video and a head of marketing, and you can learn more about both of those at 80,000Hours.org. Those roles would probably be done in our offices in central London, but we are open to exceptional remote candidates in some cases, and alternatively, if you're not in the UK, but you would like to be, we can also support UK visa applications. The salaries for these two roles would vary depending on seniority, but someone with five years of relevant experience would be paid approximately £80,000, something like that. The first of these two roles, the head of video, would be someone in charge of setting up a whole new video product for 80,000 Hours. Obviously, people are spending a larger and larger fraction of their time online watching videos on video-specific platforms, and we want to explain our ideas there in a compelling way that can reach the sorts of people who care about them. That video program could take a range of forms, including 15-minute direct-to-camera vlogs, lots and lots of one-minute videos, maybe 10-minute explainers, that's probably my favourite YouTube format, or alternatively, lengthy video essays, some people really like those. The best format would be something for this new head of video to look into and figure out for us. We're also looking for a new head of marketing to lead our marketing efforts to reach our target audience at a large scale. They're going to be setting and executing on a strategy, managing and building a team, and ultimately deploying our yearly marketing budget of around $3 million. We currently run sponsorships on major podcasts and YouTube channels, hopefully you've seen some of them. We also do targeted ads on a range of social media platforms, and collectively that's gotten hundreds of thousands of new people onto our email newsletter. We also mail out a copy of one of our books about high-impact career choice every eight minutes, that's what I'm told, so there's certainly the potential to reach many people if you're doing that job well. Applications will close in late August, so please don't delay if you'd like to apply for those ones. 80thousandhours.org slash latest. All right, the 80,000 Hours podcast is produced and edited by Kieran Harris, audio engineering by Ben Cordell, Milo Maguire, Simon Monsua and Dominic Armstrong. Full transcripts and extensive collection of links to learn more are available on our site and put together, as always, by the legend herself, Katie Moore. Thanks for joining, talk to you again soon. you

Related conversations

AXRP

15 Feb 2026

Guive Assadi on AI Property Rights

This conversation examines governance through Guive Assadi on AI Property Rights, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -1 · 136 segs

AXRP

28 Jun 2025

Peter Salib on AI Rights for Human Safety

This conversation examines governance through Peter Salib on AI Rights for Human Safety, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -3 · 196 segs

AXRP

27 Nov 2023

AI Governance with Elizabeth Seger

This conversation examines governance through AI Governance with Elizabeth Seger, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -7 · avg -8 · 110 segs

AXRP

28 Jul 2024

AI Evaluations with Beth Barnes

This conversation examines technical alignment through AI Evaluations with Beth Barnes, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.