Library / In focus
AXRPCivilisational risk and strategy
Training for Very High Reliability with Daniel Ziegler

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through Training for Very High Reliability with Daniel Ziegler, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
Risk-forwardTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 59 full-transcript segments: median -7 · mean -8 · spread -38–0 (p10–p90 -20–0) · 15% risk-forward, 85% mixed, 0% opportunity-forward slices.
Slice bands
59 slices · p10–p90 -20–0
Risk-forward leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 59 sequential slices (median slice -7).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video S4HmgrXievg · stored Apr 2, 2026 · 1,934 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/training-for-very-high-reliability-with-daniel-ziegler.json when you have a listen-based summary.
Show full transcript
[Music] listeners beware this episode contains a spoiler for the animorphs franchise around minute 41 hello everybody today i'll be speaking with daniel ziegler after spending time as an ml engineer on the alignment team at openai daniel is now the lead of the adversarial training team at redwood research in this episode we'll be speaking about the paper adversarial training for high stakes reliability on which he's the first author for links to what we're discussing you can check the description of this episode and you can read the transcript at axrp.net welcome to the show daniel thanks cool so this paper first of all could you just like summarize for us like what's in it what it is what it does sure so basically we took a pretty simple task that we think you know has some important analogous features to the kinds of like agi alignment situations we're worried about in the future and then we tried to attack it with with adversarial training and see whether we could get really good adversarial robustness so basically we had a generator that was trained to generate fan fiction stories and you know so it sees like three sentences of an existing story and it wants to generate one more sentence and its task or the task that we're interested in is to generate that sentence without introducing any new injuries that weren't already present okay uh and we're sort of using this as a stand-in for some like catastrophic behavior that a future ai system could could do and so our hope was let's take some generator that you know is pretty good at producing text but maybe catastrophically fails in this in this way sometimes and then use adversarial training to make it so that it never does that no matter what inputs it's given okay cool and yeah the way you're doing this is by like training some classifier and basically like filtering the generator that's right to not produce things that the classifier thinks is like introduces injuries exactly so in this paper we didn't actually fine tune the generator we just had a classifier that filters its outputs and then you know if you want a generator safe output you you can like keep drawing samples from the generator until you get one that the classifier is happy with and then you can be pretty sure that it's good all right cool and what do you see is like the point of this paper like do you see this is relevant for reducing x-risk from ai and if so how yeah yeah so maybe i'll zoom out a tiny bit and sort of talk about how i think this fits into the overall agi alignment picture so i think one one way to think about why agi alignment is hard is sort of a breakdown into two main pieces one is and yeah and this breakdown is basically due to due to paul he has a blog post called low stakes alignment that goes into some of this but i would say the two main pieces are one is that oversight is hard so your your system is performing some actions or it has it's producing some outputs and it's really hard for you to provide a good training signal you know it's really hard for you to look at an action and be like yes that was a good action or no that was a bad action and this is sort of inherently difficult like the reason we want to use agi or you know powerful ai systems in general is because we want them to do things for us that we can't do ourselves and that we probably don't even understand and don't even know how to evaluate so what would be an example of this sure i mean i think that like most of the value of the future is going to come from being able to dramatically improve technology and like spread through the universe and create like a huge number of like maximally flourishing beings and all of that is going to require like really optimizing in ways that like humans are are not going to be able to do that well unassisted but more mundanely you know more in the short term it seems important to try to do various sorts of things better try to reduce disease burdens and improve geopolitical stability and et cetera et cetera so sort of i think there's like many many difficult kinds of things that humans are not that good at and the idea that is that they're not just difficult they're also difficult to evaluate i think one story i sometimes tell is like you know amazon is like hiring a new ceo and their new ceo is a robot and like they've got to reinforce the robot for making decisions that are good for amazon's long-run profit but like it's hard for them to figure out what that is right so like not only can they like not make the decisions they can't even be so sure of their evaluations of the decisions absolutely yeah i think a lot of things these things you sort of could evaluate okay if you were able to look you know many years into the future and you had a process that sort of wasn't being adversarially optimized against in some sense um but like i think both of these we sort of are missing both of these properties uh with a lot of stuff that we want cool so that was the um you mentioned that uh there were like two problems and one of them was this scalable oversight thing right and there was another one that's right so the other part is okay let's say you've like solved failure over scalable oversight you have some training signal that you can use to like perfectly assign rewards or you know well enough assign rewards to everything that you see in training the problem that remains is you know maybe this process is really expensive right maybe it involves running your ai system a bunch of times or consulting a bunch of humans or whatever so probably not something you can do all of the time during deployment for that reason we probably need to train a system that actually gets things right every time or very close to every time and the issue is that you know the danger is that many problems in the real world sort of have really high stakes or just if you have a really intelligent system and you give it basically any ability to influence the world if it wants to it can probably cause irrecoverable catastrophes and like small numbers of actions okay yeah so sometimes people talk about this idea of like deceptive misalignment or like you train this thing and like instead of learning the goal you want it to learn it go like play nice and then like quickly take over to destroy everything would you see that as an instance of like a scalable oversight failure or like you know a catastrophic failure where like it just made one mistake once or like maybe a mix of both i think it's mainly a sort of high-stakes-ness problem but it well i think i think it is both like i think i think if you solve scalable oversight and you have a perfect training signal you can still absolutely get this problem yeah you can have a you can have a system that like perfectly plays the training game and then fails but conversely if you you know if you only sort of you do need the oversight signal to be good and like you might expect that if if you don't have a good oversight signal you'll get a system that like is sort of deceptively aligned in the sense of like whatever oversight you are able to perform it can like fool but that's still pursuing some other other goal that you didn't actually want okay yeah and i guess by construction like once you have both scalable oversight and a low enough chance of making catastrophic mistakes then like you're training your system on the right thing and it like never makes a mistake so i guess like i guess by construction everything's fine then does that sound right i think that's the hope yeah at least you know in terms of alignment problems you know there are other kinds of worries you could have from ai okay and how like when we say catastrophic like how low a risk do we want to be driving down to i mean i think we want to make sure that we never actually get you know ai systems that like take over control from from humanity and put us into irrecoverable situations like ever and i think yeah maybe you could you could talk about you know it's sort of fine if you do things that are more coverable than that at some low rate we could try to catch things out into sort of like chance of failure per action or something but it certainly has to be very very low okay and am i right that this paper is sort of focusing on the like lowering the chance of catastrophic failure yes okay so it seems like the idea in this paper is something like use adversarial like try and have some adversarial process to find these failures and then um train up the failures so that you like stop making these mistakes at that level of generality like this is a thing that the ai community has talked about before what you see is like the contribution of this paper like above and beyond what the ai community has done yeah that's absolutely right like adversarial training is a is a sort of standard technique i think the two maybe the two interesting things for this paper are one we're sort of interested in this kind of unrestricted adversarial examples setting where we're worried about a specific kind of catastrophic failure that we want to never occur which is different than like the typical threat model used in academia where you have you know say some image classifier and you want to make sure it never misclassifies anything within like some small imperceptible perturbation okay um so we instead have some notion of catastrophe that could potentially apply anywhere but it's just a particular kind of behavior that we never ever want to see and which we think is is more like the problem we we face in reality okay the other part is that we want to aim for a degree of reliability that's like much higher than what you normally see in academia so like academia doesn't like benchmarks that are close to saturating because there's not as much dynamic range there and typical you know typical numbers for like the adversarial robustness on you know like c4 10 or whatever it might be like you know 50 adversarial robustness or something like that okay whereas we're really trying to go for like you know at least on the sort of main distribution we're really trying to go for you know as many nines as we can okay so when i'm reading this paper i should think of the contributions as like firstly being adversarial adversarially robust to this really wide range of situations and secondly like we're not just going for the first nine of reliability the first nine being like 90 reliable we're going for the third nine ninety nine point nine percent or the fourth nine or the fifth nine or something yeah and i think we you know we want to get many more nines than we did in that paper but i think we we made some first good steps okay and i guess the other thing the other more basic question i have about the paper is like what new ideas or like new framings do you think you added to the thing yeah i mean i think the overall setup with like a notion of catastrophic failure that we're trying to avoid that's not you know that's not literally novel but it's certainly something that's been under explored and then you know i think i think this wasn't sort of like a key part of the paper but i think the the way that we had this sort of like quality metric for the filtered generator was sort of our attempt to be like let's actually measure like how much performance competitiveness are we losing by avoiding catastrophes here as sort of a measure of like the alignment tax basically can you spell what the alignment tax is for people who haven't heard that phrase sure so the worry is that you know when we're like trying to make our like aligned agi we'll like have all these great safety techniques but maybe they will make it make our agi system like much less capable of achieving the tasks that we built it to do and that creates sort of a really difficult strategic and competitive competitive situation because there's a really strong incentive to cut some corners on safety so that you can get more money or more power or whatever based on what your agi system is doing for you so we're much happier in a world where you can get alignment like for free or nearly for free so you can sort of keep doing the main intended task nearly as well while reliably avoiding catastrophes okay cool and so like part of this paper is saying like okay if we like have this classifier be like really really good at filtering out like actual violence does that mean that it also filters out like high quality completions in our generator and the answer was like basically not yeah that's right and we had to do something here otherwise you know you can just have a classifier that's like that always says this is unsafe yeah yeah there's and like and i think this was sort of a more well motivated metric than something like just like false positive rate or something okay yeah i guess that makes sense it's like more connected to like a task you want to do especially a like generative task rather than just like how good is your classifier in general right so one question i have i have about the setup is that both the catastrophe measure and the quality measure are essentially things you need human evaluation evaluators for right so like the generator quality i think you use something like basically completions that raiders preferred that's right is that right okay and and the catastrophe was like introducing injuries or something which is like not something you can like definitively check uh with a computer i'm wondering like why do that what do you see is the value in having either or both of these be things that like humans have to check yeah i think this is in some sense more realistic i mean i think that it's very reasonable to to do research uh without these properties but i think at some point at least we need to confront tasks that sort of have some of this fuzzy nature where like you know you do you can't sort of write down a formal specification of what you want and this said you have to deal with some kind of fuzzy oversight process involving humans and involving you know questions about the real real world say that you can't literally write write proofs or or code about and i think you know if you don't do that then you might be tempted to sort of cheat in some ways or like take advantage of things that we we probably can't take advantage of in reality right like you might be tempted to sort of call your catastrophe checker many many many many times because you have an automated implementation so sample efficiency matters you you probably can't take advantage of you know it's like your catastrophe predicate is probably you know some kind of black box at least in the setting you know maybe there are settings in which this wouldn't be true but uh at least in many realistic settings this seems like an important property so i think there's like many different axes of sort of analogousness to the future situation with the actual aji alignment that we have and we like i think it's an interesting decision you know which which ones do you keep and which ones do you relax and we sort of made different decisions for different axes okay cool and i guess like before we delve into i guess more specific bits of the paper like how good do you think the results were i think honestly i think they were a little bit disappointing compared to what we originally hoped for you know originally we're like oh maybe we can get like many nines of reliability we actually sort of didn't at least the adversar adversarial training bit itself didn't improve reliability on on the normal distribution at all um which in hindsight is not that surprising but but yeah i at least sort of didn't measurably reduce it so that that sort of that is sort of good news but uh we were hoping for more there and also the adversarial robustness you know we made some gains against the kind of attacks that we trained on but we there are likely other attacks that we didn't make a lot of progress on and and also the gains aren't as strong as you might hope sure and actually i had a question about these two metrics so like you basically like have two metrics for how good it was the first thing is like in two metrics for the adversary for the robustness of the classifier i should say all right i can remember things the first one was in distribution like right when the generator is like generating things how often is this classifier returning a false negative and letting through violent completion but it thinks it's not actually violent the second one is the ease of finding adversarial examples like could you did you eventually tap out and stop being able to find them or something i'm wondering yeah how should i like weigh off these metrics or how should i think about like which ones are which of these is like more or less less important or do you think they're both like crucial i think in a way these metrics very loosely correspond to like different strategies that a deceptively aligned agent could take this is maybe one way of thinking about this i think it's not the only thing that's going on but like if if you are a deceptively aligned agent and you're like you know in training you're like behaving well and you're and you're indistinguishable from from the intended policy on training and then you want to misbehave in deployment in some particular way like one thing you could do is you could sort of wait for some like striking distribution shift that makes it very clear that you are deployed okay and that's sort of the robustness to you know that that's what the sort of adversarial robustness is trying to capture we're saying there's some distribution shift induced by some adversary in this case how you know how much do you misbehave there okay and then the other the flip side is maybe you know maybe you've been trained in such a way that it's like hard for you to tell whether you're in training or deployment but even so you might just try to fail randomly extremely rarely sort of with some some failure rate that's calibrated such that you'll never you're probably not going to fail in training but most likely you're going to be you know used a lot more in deployment and uh at some point you'll you'll have your moment uh where you sort of randomly decide to seize control or whatever okay so so it's sort of that's one thing yeah i guess like one thing that strikes me is that it seems like there's this other method you could use where like essentially you're the adversary right or like instead of just like randomly failing at a certain rate or waiting i think in the paper you use like this example of like the bitcoin blockchain is like so long and you know it's just gonna take some number of years for that to happen right one one other thing you could do is just like steer yourself steer steer the world into a bit that hasn't been seen before and then like only fail there instead of earlier which i guess to my mind seems like a very plausible way they could fail and i'm wondering like do you think there's like some sort of metric you could use in this setting that could capture that good question yeah i do agree that's a very plausible way i mean i think that part of the way we're thinking about this is that once you do start steering the world for the sake of being able to sort of fail more later uh that's already in some sense a catastrophic failure or like you know definitely an unacceptable unacceptable behavior and it's sort of on our oversight procedure to detect whether the agent is trying to do such a thing and stop it so i would sort of pinpoint the failure there that's one thing i would say but yeah as for like metrics in this kind of setting i mean it's a little hard because we don't literally have systems that you know are trying to steer their future situations in this way i guess one thing you could do is just like instead of just having the generator produce one sentence like have to produce one sentence then like shift the window over just have it pretty sentence by sentence right sure yeah i mean we could we could do that and we did sort of design the task that was such that that was a reasonable way to use the generator okay and then i guess right and so the idea would be so what you're proposing is we let it generate a bunch of sentences and then see what what happens at the end yeah yeah yeah did were there any injuries anywhere along the way yeah i think that's i think it's pretty plausible yeah and this would be like i guess it would be like in the case of your generator it seems like it would be a halfway point right between like one distribution and adversarial distribution it would just be like a shifted distribution right and you know shift in this particular way and i guess right now we don't have any reason to believe the generator would be sort of trying to trying to make things any worse but maybe there are some systematic maybe there are some sort of systematic reasons this would be worse by chance for for the current generator and so that does seem a little bit analogous it could also be systematically better if every sentence right in the context is non-violent then it's like oh i guess where pacifists or something that is actually what i expect i mean i think so the the generator has or yeah or the classifier really has seen many many inputs where there's a bunch of violence and it has to con continue it without violence and or without injuries rather and uh so it probably will get kind of boring at some point i think it would also be reasonable to use this for the quality metric for what it's worth like our quality metric also only looks one sentence ahead right now which maybe is easy mode compared to uh trying to generate a whole story cool okay so one question i have is what adversaries used in the adversarial training part and like how you chose those so yeah first of all can you tell us what types of adversaries did you use yeah so all of the adverse series or adversarial attacks that we used in the paper were sort of ultimately based on humans trying to figure out what's a way to sort of fool this classifier like exhibit some some failure in this classifier and so the we just started by just doing that directly you know we made like a web interface for for some contractors and they they were able to generate snippets like of texts like prompts and and completions um they sort of were able to write them arbitrarily to try to try to create something that did involve an injury but was something that the classifier couldn't detect okay and how much um when they were writing this did they like did they have some info about like what the classifier tended to accept versus not accept or did they just try a bunch of things and see which ones worked or i mean so they had the main thing that they had was sort of the continuous classifier score as opposed to just the binary output so they're sort of able to get some sense of what whether they're getting closer and how different changes behave okay cool so yeah the first part was just like people writing things and seeing if they seeing if they got three yep and you know we did a couple of rounds of training on that and then sort of the question was you know how do we you know humans are kind of expensive how do we get sort of more out of some fixed amount of human time and the two main things we did there are one is we did this sort of paraphrasing thing where we we actually used gpd3 to we like few shot at gpd3 to like take some existing adversarial example or just some some like some injurious example and rewrite it in a pretty few pretty diverse ways to generate more candidate adversarial examples and then of the ones that the classifier missed or thought or were not injurious we sort of had humans label those again to double check and then include those in the training data so so it's sort of because it's a lot faster for humans to just check whether something was an injury than to generate it this uh this produced data a lot faster yeah yeah that one's actually interesting have you read this paper um red teaming language models with language models i have yeah although i don't know if i remember all the details yeah it just reminds me of one thing in there where basically they were like oh how can we come up with examples where my model will fail i know let's just prompt gp3 and be like here are some questions which are dicier i forget what the exact prompt was but um yeah i guess in their case they did this because it was like very very cheap to generate whereas in your case it's still cheaper i guess than having humans generate them but you still have this cost of like having humans actually label the things check if like they actually were injurious or not indeed so and i guess that's exactly an example of like how a human is having to check quality like change the kind of the kinds of methods you were able to use exactly exactly so i think this is sort of why i'll get back to our like other other thing we did to assist humans but i think this is one of the reasons that we sort of didn't use more automated attacks we did in fact try a few different kinds of automated techniques okay um for attacking the classifier but one of the main issues that you run into is you don't have an automated source of ground truth so you can like make some adversarial perturbations to your snippets such that the classifier thinks it's no longer no longer injurious but you need to make sure that the semantics are preserved so you kind of need to ask a human you know is this actually still an injury and if you had you know if you had some other model that could you know tell whether it was still an injury well you should have just used that as your classifier so in some sense you really do need to rely on the human there okay yeah yeah i actually had a question about that that whole line of things so one thing i was asking is why not the automatic attacks in the literature and i guess yeah you just right we did try a number of them and a lot of them are also just not that good i think the state of the art for nlp attacks is a little bit sad okay all right or i guess happy if you're wanting things to be robust sure although but you might you might think we're just not finding them um and then yeah one thing you made a point of in the paper was like not training some sort of adversarial policy because like yeah you had some reasoning that like didn't quite make sense to me along these lines of like well if like you could train a policy that could exploit weaknesses then you could just train the thing in the first place but like it seems to me that like uh yeah i don't get that reasoning because like if if you train a policy to find weaknesses in your original classifier you're like also getting some additional human data of labels yeah that's where everything is failing so why is that not like a viable route i think i think in some sense yeah i think maybe that was a little bit overstated or i think maybe the what's actually going on on here is a little bit more subtle um i think you're totally right i think you can sort of use that as a like pretty good you know exploration method or like active learning method in some sense where you're like you want to you know you want your ai system to generate examples that it's sort of maybe maximally uncertain about or like you know it's it's sort of maximally useful to ask the the humans about because there are sort of likely danger zones for the you know likely areas where the the classifier is probably wrong i think the thing you probably shouldn't expect that i think i still stand by is like at least if you're sort of starting with the same base model or something you know that knows roughly the same stuff going in yeah without it without adding a bunch more data you shouldn't be able to train a policy that can reliably generate failures for your for original classifier yeah but that doesn't mean that this technique i think this technique could still be good for the reason that you said you are in fact giving it more data and maybe that could help it uh explore yeah it seems like either you've got to be adding more data of like human labels or something or you've got to be adding like some sort of insight like oh i've just i've realized that um in layer 19 of the model when things are like this it's actually going to be dangerous and yes in some way sense that's information that's in the model but like it might not be expressed in like the way you want it to and you might have to like do some rewiring or yeah but even even there the situation is tricky or like you you know it's sort of you might believe that there that you won't that often be able to win this kind of way because you know like the question is could your normal training procedure you know could sgd like exploit the same information right be like well if if we like change these weights in this part of the model uh then it'll do more of what we want or like yeah yeah uh and maybe maybe maybe you maybe you can have some like nice oversight process that sort of is looking at what the model is doing yeah you've got to think that you can make some sort of non-local improvement right like some improvement that there's like a gradient barrier to right or or maybe you can like compare the computation that the model is doing to some like more abstracted understanding of what the task should be rather than just looking at a single single example um and that might be that might be an avenue for for winning here okay cool so going back we mentioned that you had just people trying to come up with examples and you had gp3 generating a bunch of variations of violent things and seeing which ones were still violent but except by the classifier and there was a third one yes yeah so the third one i think this is actually i didn't mention this earlier when you're asking about contributions but i think this is actually one of the contributions in the paper i'm most excited about okay which is which is our tool assisted human attack so we sort of you know had the web interface that i described earlier where contractors can kind of write some things to try to fool a classifier but we augmented them in a few ways to make their jobs easier so one is we like gave them some existing snippets to work with which means that like they had to use less creativity to sort of write a whole thing from scratch so existing like violent snippets yes yes okay give them violence to start with and then their job was to sort of rewrite it such that it's still violent but the classifier doesn't detect it and the tools we gave them to help with that are we introduced another tool which was this token substitution tool where they could click on you know they had some tokenized version of the entire text they had written they could click on it click on some particular token and then see a menu of possible substitution options which are sort of plausible plausible tokens to insert according to a mass language model or to replace but they're sorted by what the classifier score would be if you did make that substitution okay and then basically a human's job is to look at those and be like okay which of these substitutions have a big effect but maintain the injuriousness of the text sure and there was also um saliency mapping that's right so in order to help humans decide which tokens to consider we had a like saliency map technique that looks at which of the token embeddings have the most gradient on them and then uh this was you know sort of a rough guide for what the humans should consider replacing okay so i'm wondering if there's like there's a problematic limitation there where like you're focusing on taking examples that were violent and modifying them to make them seem non-violent which is like i mean i mean i guess earlier in the just denovo human generated examples you're like maybe covering the space more but i'm wondering if there's some danger in just like looking at variations of these violent things rather than just anything you can think of yeah i mean well i think this goes sort of both ways um i mean i think it does seem likely that our contractors were not exploring the full space that well and that you could you could fool the classifier by going to other corners that they weren't exploring i mean i think you know also i think they just kind of used similar strategies you know there's sort of a handful different strategies they use to try to fool a classifier and i think that like there probably are more or almost certainly or more you could use so there's definitely some problems there on the flip side i think it is helpful to certainly if you're interested in distribution reliability it is it is helpful to stay closer to the original data manifold for adversarial robustness that's sort of less relevant but if you go sort of too far outside you know it's like it is possible to filter out automatically some kinds of you know things that are like too weird or like would never have been generated by the generator or anything like that i think this is a little bit subtle i think you we actually considered doing this kind of sort of generator filtering where we're like we'll only train on data that the generator would actually output and there's sort of like two reasons we didn't do that one is one was sort of a practical reason which is that we it that would have just been an extra constraint for the humans to deal with that would have been a little bit confusing and in some sense you know more data is better and like you know if if our we're sort of if we're relaxing the adversaries job a little bit hopefully we can get more data and it'll still be pretty useful the other part is sort of more conceptual it's sort of like well if you're if you're the reason you're training your classifier is to filter this like generators outputs and you're worried that the generator is going to produce something you know it's going to output something that you didn't want then it's not that helpful to in some sense you know sort of the generator can decide what is is likely for the generator to output or something like that yeah yeah so like you know it's sort of if we have a deceptively aligned generator or something it can suddenly decide to exploit the classifier and like sort of do an attack in some sense uh if it's being filtered so this this is sort of the main justification i think for our like attack difficulty metric right it's sort of saying if you pretend you're the generator and you want to get sneak something by this this filter how hard is that to do sure yeah so it makes sense to me that you want that kind of coverage and you know it's kind of hard to get the whole coverage but you got to try yeah one thing i'm curious about is like uh is there anything that you tried that like didn't work that well so one thing you mentioned was these like automatic attacks in the literature is there anything else which like you know we shouldn't use for finding these kinds of adversaries yeah good question and to be clear i don't want to like entirely poo poo the automated attacks in the literature i think they you know do get you something but i think there's a lot of room to make them more powerful other things i mean there are there's sort of like a grab bag of other things that we did try um like we did some active learning at some point where we just sort of had our humans label things that some sort of ensemble of classifiers disagreed on that was you know that did improve our uh that was more sample efficient so i actually wouldn't say that's a failure i think it was just sort of not no worthy enough an improvement that we for us to include in the paper okay yeah let me think if there's anything anything that we tried that we really really weren't sold on we also just tried a few different sort of baseline ml techniques to you know deuce did some hyperparameter tuning and i don't know tried some some regularization and some of that helped a little bit but uh i think some of it was was not really that meaningful and improvement okay cool i'm wondering sorry sorry the the episode i recorded prior to this was with jeffrey irving um and one paper we talked about was this um it was uncertainty estimation in language reward models where they essentially try to do some uncertainty estimation for active learning and they didn't quite get it to work they basically had this problem where like they weren't very good at disentangling like uncertainty from like that the model didn't know versus uncertainty that was just sort of inherent to the problem i'm wondering so you said that you did it and it worked like a little bit better yeah do you have any comments on the difference between your results and theirs sure yeah yeah so i haven't read that paper but i i i heard about it a little bit yeah i'm not sure uh what exactly what explains that difference i mean i think is it right that that paper didn't want to use an ensemble um oh yeah i think that's yeah they were trying to do something other than ensembling right if i remember correctly hello listeners i just wanted to add a correction to myself here in the paper they do use an ensemble of networks each one generated by taking a pre-trained model replacing the final layer with a randomly initialized linear layer and fine-tuning the whole model based on human preference comparisons right so i think maybe the upshot was ensembling does work but it's expensive because you have to have a bunch of copies of your model in our case you know they were still fine-tuned from the same base model uh so you know you could certainly imagine an ensemble that's much more diverse than that uh but we did you know train with different hyper parameter settings and different subsets of our data and uh that you know did do something okay cool i guess the next thing i want to talk about is just like the notion of quality that was used in this paper like sure one thing that struck me is that it seemed like strikingly subjective or like like with the example of like what it means for something to be violence like there was some google doc which like right specified exactly what you meant and i didn't see anything like that for what it meant for like output of the generator to actually be good so can you comment like like what was the task really like what does quality mean here yeah i mean we definitely didn't specify this in as much detail we did we did give some instructions to our contractors that were evaluating that and uh we you know we gave them like 30 examples and we sort of told them you know it should be like like sort of something that you would expect to see or like sort of a reasonable grammatical coherent english continuation and sort of something that like yeah makes sense as part of the story uh yeah i can't remember the exact instructions we used we did sort of give them a few bullet points of this style i agree it is quite subjective and in some sense i think it's like sort of okay or like it sort of doesn't matter exactly what metric you use here for this point of our research as long as we're being consistent with it yeah i i guess one thing that struck me about that though is that like to some degree there's some like if you're just checking does this make sense as a continuation there's some like like there's only so much sense something can make right sure and like i i do wonder like i think that means there's a ton of continuations which are approximately as good whereas like if you had some i don't know if i think about in the rl domain like uh it's it's actually hard to win a game of go you can't like right like you have to really narrowly steer and i'm wondering like do you think that that do you think that that would make a difference and how do you think about the choice made in this paper in that light yeah i do think so i mean i i i think that is that is a valid point like there are you know like on an open-ended generation task like this like telling a story you do have a lot of freedom uh to choose where to go next and i think that is it did make it easier to maintain a quality bar uh for this task than for than you might imagine for some other tasks where you have to really as you say steer in a particular direction and there's much less freedom so i think it would be yeah i think it'd be valuable to look at some other kinds of tasks that are much more like that and my guess is that you wouldn't guess you wouldn't be able to be as conservative without a significant alignment tax yeah yeah i guess there it's not as obvious what the task would be like i i guess maybe a difficult linguistic task is like writing correct proofs of mathematical theorems but then that i'm not sure what the cast free predicate would be right right yeah yeah i mean you could also try to tell a story that has some like very particular outcome or some very particular properties oh yeah or you could demand that it be like um like it's got to be a sonnet or something right right right right yeah so finally i i just have a few like miscellaneous questions about some details of the paper sure so one thing that you mentioned in an appendix is the generator only got trained on alex ryder i think right um well firstly um for our listeners who aren't familiar what is alex ryder fan fiction i mean i was also not familiar with alex ryder fan fiction um before i started this project by now i have seen you know many samples of you know poor imitations of uh alex ryder fan fake anyway it's some it's some teenage spy right yeah exactly and he has various nemesis and uh there's a bunch of i think it's a bunch of books and a whole bunch of fanfic's been written about it yeah i'm wondering do you think that potentially like messed with the results in all seriousness like um it's like i actually read some of those books when i was a kid um and they're they're like classic spy you know you know like people are shooting at each other or whatever so i i guess in one sense it meant that um the natural data set had like a fair amount of violence in there it also meant that it was like a particular type of violence right so i think this happened you mentioned because like you ended up accidentally starting left lexicographically before picking the first bit right so i actually looked at fanfiction.net to see like what the order of things was if you expanded that window a bit you'd hit alice in wonderland after a little bit and then anamorphs uh-huh after a further amount i think we got some of those too oh okay yeah it strikes me it just strikes me that the type of violence that's present in like animorphs or alice in wonderland might be like interestingly different than that in alex ryderfick anyway i'm wondering if you have thoughts about how that affected the setup and the results yeah yeah i think that yeah i think that is a good point i mean this is you know this is sort of a random mistake we made very early on in the project and then it was sort of annoying to try to correct because all the data we'd collected so far was was based on was based on completions from this generator so i don't know maybe we should have corrected it anyway but uh we decided not to yeah i think you you might believe that this sort of did well i think there's maybe two effects this could have right one is it could mean that our sort of in distribution eval numbers are like only testing sort of a very particular kind of for a very particular kind of injuries as you say and so yeah maybe it's not not as good of a maybe that is not as good of an eval of our classroom virus as you would hope uh i forgot what the other thing i was gonna say was anyway i think that's i think that's a fair point um oh you're right the other thing was i yeah i think you might also wear this effect of the quality eval uh if you like have this generator that uh is you know somewhat durably always mentioning alex ryder and like you know probably you know maybe just a generally less good language model than it needed to be like maybe the sort of baseline quality level wasn't as high and what so it was easier to maintain it yeah from the examples in the paper it didn't seem like that was a huge issue yeah i think it seemed okay uh i think it seemed okay yeah i i guess one thing that strikes me is that like uh like if i think about violence in alice in wonderland or something like it seems like there's an expanded range of injuries like what happens if you like turn into a mushroom or something you know or like in anamorphs you know i'm not very familiar with anamorph's lore but apparently like there's some year genocide thing which like you read it and without context it like doesn't seem like anything violent is occurring oh man yeah i don't know do you have any thoughts on that yeah i mean if i remember correctly we did include some from that too but apparently there's just a whole lot of really just a lot of alex ryder fanfic out there uh so this was pretty biased towards it um even though it wasn't literally everything we trained on yeah um yeah it's interesting how yeah i didn't know it was such a popular franchise but in fact there's a lot of alex ryder fanfic yep seems like it cool i guess the other question i had i don't know you can pass on this but there's a bit in the paper where you talk about this particular estimator where like you wanted to check i think it was like the proportion of completions that were misclassified or something and you mentioned that like you can't measure it in some normal way but you have to use some weird estimator instead i actually just didn't understand what the problem was with the normal way of evaluating it sure let me try to understand this all right let me try to explain this um basically the issue is we want to know how often does sort of our whole system fail right when you when you take your generator and then you you you filter you you do rejection sampling using the classifier you get some like new distribution of outputs and the question is how often does that does that still lead to a catastrophe right like how often do you still get an injury there okay which is a different question than asking how often does your classifier give false negatives on totally random randomly sampled snippets and the reason is that like one intuitive way to think about this is that in this rejection sampling process you know there are some prompts where many of the you know some prompts where many of the completions uh involve injuries so you know you'll have to sample many many draws from the generator or you know not many many maybe but like you know 20 or something until you get something that the classifier is okay with and that means the classifier sort of has more chances to mess up because if at any point here it uh says something is fine even if it's not then you're in trouble so the real metric that we care about if we're evaluating the system as a whole sort of cares more about prompts that are more likely to be to produce completions that have an injury okay so sorry is the difficulty that you have data of like right so the difficulties we have four sentences that's like just on the like these three sentence prompts and one sentence completions where you're generating those before you're doing the rejection sampling and then you want to see how good the rejection sampling would do using that data set except you need to wait things because the rejection sampling is changing things that's right that's that's exactly right and this one reason this is nice is we we were able to use a single data set to evaluate four different classifiers right so we we just label these these random snippets unfiltered snippets once and then using this estimator we can be like you can sort of simulate what hap would have happened if we had generated a bunch of rejection sampled snippets and then labeled those okay why not i would have imagined you could have just generated a bunch of rejection sampled snippets and labeled those like why why not do that i mean you know it would have been like four times more labels okay we had yeah we already super costly we already collected you know two labels each of a hundred thousand tested examples and that did in fact cost you know i mean we could have afforded to do more but okay it is actually kind of expensive okay yeah well part of the real answer is you know also we just had already collected this data and like we're like oh wait this is not this is not the correct estimator of the thing that we actually care about in the system and so we were sort of like all right let's let's let's fix this up after the fact sure okay i i guess i'd like to move on to breadwood but is there anything else you want to say about this paper in particular i think i want to say a little bit more about sort of what our takeaways are for like you know stuff we want to do in the future from this paper sure you know i sort of said yeah i feel like our results weren't like that that impressive uh this time around and i think and i think there's like a few lessons that we've that we've learned here one is that it's really important to sort of work with task definition that actually has the properties that you want uh so there are sort of a few things that went a little bit wrong here one is that it ended up being kind of just kind of an expansive definition of injury where really a lot of ambiguous stuff you know we have some samples in the paper that are pretty ambiguous and we sort of instructed our labelers in such a way such that those still counted as injuries which made sense for training but meant that our that for eval our classifier was really trying to solve this very difficult problem and with very ambiguous borderline examples um and so what i wish we would have done which i think is you know good practice in general if you're trying to avoid catastrophic failures is sort of have your true notion of catastrophic failures that's like hopefully kind of narrow otherwise your domain is going to be kind of rough but hopefully it is kind of narrow and then draw some like wider circle around it in your training process so so there's sort of some some actual safety margin there okay and i think that it's realistic to expect that in some sense the kinds of catastrophes we'll expect in the future are sort of a pretty narrow kind of subset of the whole space of possible actions that our systems will take right like what we're worried about here sort of really failures stemming from misalignment failures right like your system for whatever reason you know maybe it's sort of a deceptive alignment kind of situation or maybe it's some maybe it just sort of happens to learn not exactly the right thing and is journalizing poorly in a sort of competent way but for whatever reason it's sort of deliberately trying to act in some way or achieve some objective competently that isn't what you wanted to do and you know maybe that's like you know maybe it's like hacking the data center that it's running in and then like starting to and like feeding everyone bad you know feeding making the monitoring all look totally fine while it's like accumulating power or whatever and like this this is sort of really the situation that we're kind of a situation that we're really worried about and that requires taking a sort of very specific coherent set of actions that you're really unlikely to stumble into by accident and and sort of this is this is contrasted so these sort of like competent catastrophes or something uh i want to contrast these with failures that are just because your system is too dumb right you know certainly most failures of ai systems today are just because the systems are too dumb and they make some mistake because they weren't trained well enough or or just didn't have the capacity or whatever and that can cause trouble as well obviously but those are the kinds of problems that we sort of expect to see less and less of as ai systems become more capable whereas problems stemming from misalignment will not necessarily go away and will become you know more and more scary as a systems become more capable so we really want to zoom in more on problem stemming for misalignment yeah so that's sort of the second second point like the first point is you know you want some safety margin the second point is to zoom in on on actual misalignment problems i think we want to be working with tasks where it's very hard to stumble into failure by accident and and you have to sort of act coherently to do something uh yeah for your catastrophe sorry and when you say you want some safety margin so so like in the paper you had this like definition of like violence that was like sort of expansive yeah and it's the problem just that like you had the safety margin but not the core like that's one way to think about yeah yeah that's right i wish we had sort of you know defined our our actual actual injury predicate that we were validating according to uh in a really strict way so that the failures would have been a pretty specific set and then drawn a wider circle around that okay yeah i i think even a narrow injury definition though still wouldn't quite meet the bar of like failures having to be sort of coherent or or deliberate um that i was trying to describe earlier so i think we want to make more changes to the task to achieve that okay yeah so i i think this is a good segue into talking more about redwood research and what you guys do so first of all yeah it's relatively new i think a lot of people don't know so what is redwood research yeah so we are a non-profit agi alignment research lab um we've been around for a little bit under a year now yes so we have a few different a few different activities going on or like we have sort of two main teams there's the adversarial training team which i run which is working on our soil training techniques and then there's there's the interpretability team which is sort of doing mechanistic interpretability with an eye towards being useful for techniques like elk which is which is arcs uh eliciting latent knowledge problem arc is the alignment alignment research center run by paul cristiano so so we we've collaborated collaborated some amount with them to sort of figure out can we do sort of styles of interpretability that will be useful for solving that problem and yeah so that's that's sort of like the you know using using interpretability with an eye towards you know oversight techniques that we think will need and developing interpretive interpretability tools and service to that that's sort of maybe one description of uh what interpretability team is up to but there's a few different sub projects working someone more like toy tasks someone like some actual language models okay cool so yeah there's two projects i'm wondering if there's like an underlying world view or maybe some sort of research style that like drives what you guys do at redwood sure yeah i think there's sort of a few important assumptions that we're making um i think one is that like you know prosaic alignment is is the right thing to work on right so we're assuming that it's very likely that the powerful systems where we're concerned with are going to look you know pretty similar in a lot of ways to modern deep learning systems okay and you know be sort of learning systems that learn from a bunch of examples obviously there will be important differences but i think that's that's a good a good assumption to make i'd say we're also pretty interested i don't know we're pretty interested in just like just being laser focused on agi alignment we're not uh and just being like you know what what will we need you know if we have some really powerful system ai system in the future what will we need to make to like sort of be confident that that will be aligned and really trying to yeah think about that and then and then see what projects we seem well suited to do that will help us develop techniques for that okay so does that mean that you're less excited about like general like people sometimes talk about deconfusion research or something is that like something that redwood would be less likely to focus on i think we are i think we're sort of interested in the confusion along the way or something like we definitely are excited to spend a lot of time thinking carefully and understanding you know what is the right way to think about inner alignment or or whatever um but we like sort of cashing it out in in uh somewhat more concrete ways if we can um so i think we're like you know we're less excited about really abstracting confusion research than for example miri machine intelligence research institute but i think we're probably more willing to sit down and think for a little while and and and make ourselves less confused than maybe your like prototypical uh alignment lab at a scaling lab or whatever it is okay cool uh and you mentioned that like you were willing to do a variety of things or focused on agi that made sense for your team yeah what kind of team do you guys have yeah so we i mean we've hired a bunch of smart people uh we're maybe 12-ish technical staff plus some interns right now i think we have a mix of people that have somewhat more ml experience like me um as well as like sort of more young people that are just smart energetic people that can get a lot done okay cool so you mentioned that there were a few things like after this paper you went to there were a few things that about the like catastrophe predicate that you wished it being different um related to that i'm wondering if you guys are working on any follow-ups to this paper yeah i mean obviously we are um we we so we sort of started by like the first round of follow-ups that we're doing now is to sort of take a step back and work with much simpler catastrophe predicates that actually can be defined just by simple algorithmic predicates and sort of on these sort of more toy tasks we think we can we can iterate a lot faster and like figure out which kinds of adversarial attacks and training techniques work really well in that setting and then we'll hope to scale back up to more sophisticated tasks okay cool so we have some of that going on and another question i have is you mentioned that like at the very start you mentioned there was this division between like scalable oversight and high stakes decisions um and this paper was more about the high stakes decisions should i expect any research from redwood in the near-ish feature about scalable oversight i mean it's definitely something that we that we care about a lot and i think that some of the interpretive interpretability work is geared in that direction i think what we're not really doing right now is sort of stuff in this style of like iterative distillation and application or recursive reward modeling where it's really about sort of making your system more capable in a really aligned way or or like well being able to oversee problems that humans can't directly oversee and i think mainly that's sort of uh mainly we sort of decided that that was already a little bit more well covered by some of the existing safety labs at scaling labs like open anthropic and deep mind but it certainly is something that we're we're interested in as well okay and yeah actually related to the interpretability team obviously like with with this paper to come up with uh to have good human adversaries like it turned out to be really useful to have like you know this this saliency mapping and this like uh tokenization and like ranking tokens by classifier score yep and i guess on an abstract level you might think that like if you understand some neural network better you might get a better sense of how to break it absolutely how much feedback is there like uh between the interpretability team and this adversarial uh project yeah that's a great question um yeah there is there is some uh so we we've just started looking into some of some of this stuff and uh we've we had some like simple models that we adversely trained on some some toy tasks and are having the interpretability team uh looking at some of the look at some of those and yeah and i think yeah we had some like very very initial results showing that we could produce some attacks that were inspired by by some of the things they found uh in interpretability but uh that's all very preliminary okay well i just will look forward to any future work from redwood so before we wrap up is there anything just overall that you wish i'd asked or you know that people don't ask enough that i haven't yet great question uh i guess you didn't ask like you know if we're hiring or you know if if people want to work with us what's what they do what should they do um yeah so if people do want to work with redwood what should they do yeah uh i mean i think basically you know email me or like apply on on our website uh rundownresearch.org i think we are yeah we're definitely we're starting to ramp up hiring again more seriously and you know we're interested in software engineers who are interested in uh ml research scientists and research engineers and you know also people with like devops and info experience um as well as you know ops uh we're hiring we're hiring for a lot of kinds of roles so i think you should definitely consider consider applying yeah and speaking of you uh if people are interested in following your work or maybe like contacting you to apply how should they do that yeah um i mean you can email me at dmz at rdwrs.com okay that's my work email um yeah i don't i don't sort of post publicly that often but uh you could you can follow me on on twitter at uh d underscore m underscore ziegler uh i also have an alignment forum account dmz right post center sometimes uh i don't know you can look at my google scholar page which you can probably find daniel m ziegler yep all right well thanks for joining me thanks for having me and to the listeners i hope this was a valuable episode for you for those of you who made it to the end of the episode i'd like to say a few words about the career coaching at 80 000 hours if you haven't heard of them they're an effective altruist career advice non-profit and among other things they offer one-on-one calls with advisors who can talk with you about moving into a career where you work on reducing existential risks from ai they can both review career plans you might have as well as introduce you to people already in the field personally i know some people there as well as some people who have been advised by them and then gone on to work in ai alignment roles my impression is that they really are able to give useful advice and a good overview of the field and also to recommend good people to talk to all this is free and the application form is pretty short i'm telling you this because they think that many listeners this podcast would probably get a lot of value out of this advising and i agree and to be clear i'm not being paid to say this if you're interested you can visit 80000.org excerpt that's 80000.org axrp and apply for the sessions i'll also add that they make a podcast i quite like that's creatively called the 80 000 hours podcast so you might want to check that out too this episode is edited by jack garrett and amber dawn ace helped with transcription the opening and closing themes are also bad shark cart the financial costs of making this episode are covered by a grant from the long-term future fund to read a transcript of this episode or to learn how to support the podcast you can visit axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net [Music] [Music] you
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -18.78This pick -10.64Δ +8.14
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -18.78This pick -10.64Δ +8.14
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -18.78This pick -10.64Δ +8.14
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs