Library / In focus
AXRPCivilisational risk and strategy
Concept Extrapolation with Stuart Armstrong

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Concept Extrapolation with Stuart Armstrong, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 63 full-transcript segments: median 0 · mean -2 · spread -20–0 (p10–p90 -10–0) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.
Slice bands
63 slices · p10–p90 -10–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 63 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrpcore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video P8EgpMfej5s · stored Apr 2, 2026 · 2,876 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/concept-extrapolation-with-stuart-armstrong.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody in this episode i'll be speaking with stuart armstrong stewart was previously a senior researcher at the future of humanity institute at oxford where he worked on ai safety and x-risk as well as how to spread between galaxies by disassembling the planet mercury he is currently the head boffin at aligned ai where he works on concept extrapolation the subject of our discussion for links to what we're discussing you can check the description of this episode and you can read the transcripts at axrp.net well stuart welcome to the show thank you cool good to be on yeah it's nice to have you so i guess the thing i want to be talking about today is your work on or your thoughts on concept extrapolation and model splintering which i guess you've called it can you just tell us what is concept extrapolation model splintering is when the features or the concepts on which you built uh your goals or your raw functions break down traditional examples are in physics when instance e the ether disappeared it didn't mean when the ether disappeared that all the previous physics that had been based on ether suddenly became completely wrong you had to extend the old results into uh a new framework you had to find a new framework and you to extend it in that way so model splintering is when the model falls apart or the features of concept fall apart and concept extrapolation is what you do to extend the concept across that divide like there was a concept of energy before relativity and there's a concept of energy after relativity they're not exactly the same thing but they uh there's a definite continuity to it cool and can you give us an example of so you mentioned like uh you know at some point we used to think there was an ether and now i think there isn't what's an example of a concept or something that's splintered when like we realized there wasn't an ether anymore just to get a really concrete example um maxwell's equations maxwell's non-relativistic equations are based on a non-constant speed of light or maxwell's equations are not relativistic though they have a relativistic formulation and i thought i thought they were i'm isn't that why you get the constant speed of light out of them okay if that's uncertain then uh let's try uh another example um we could do um energy once you enter general relativity or energy uh inertial mass for examples those concepts need a new toenail needed a newtonian universe uh to make sense and when it wasn't so much the absence of ether that it was the surprisingly constant speed of light that broke those so when you measure the speed of light to be the same no matter what your own movement and acceleration is this breaks a lot of the newtonian concepts okay and is the idea that there were like potentially many ways you could have um generalized them or reformed them to work in a relativistic universe uh no not really okay there this is part of the interesting thing about concept extrapolation is that sometimes there's only one real inheritor of the old concept and sometimes there can be multiple ones like temperature our modern concept of temperature as the derivative of energy with respect to entropy for example is one of the direct sentence of well it feels hot today or versus it feels cold today but um our neuroscience qualia how we feel things is another direct descendant of that old concept okay physics might lead us a bit astray because extrapolations tend to be clear and definite and only one though maybe that might be because of the choice that we've made there might have been multiple ones that were discarded along the way but in general it's sometimes there's a clear extrapolation sometimes there are multiple routes you can go down and sometimes there's a mix between the two okay and should i think of concept extrapolation like is this the kind of thing that you do when like you learn something new about like the way the world works or like like am i also going to talk about it in the case where i just like you know go to a new place that i haven't been before and like things are slightly different there but like it's not like the laws of different it's like the laws of physics are fundamentally different or anything all of these are on a continuum um if you look at how we've tried to define model splintering there's the smallest of of changes or can be captured in the formalism or the you you basically have ontology changes at one end and you have is a car that is painted a new color uh still a car at the other end okay and i guess the idea is that you're gonna try to think about all of these with like one framework yes okay and the car that is painted of a slightly different color is a an example of one that there's a trivial single extrapolation yes it's a car it's uh where uh and the ontology changes or extreme ontology changes can be ones where it's really not clear what to do okay so now that i think we've got a good sense of what concept extrapolation is how do you see it as relating to ai alignment or like preventing doom from ai in a sense i see it as the the the very center of it uh nate has recently published something about the left turn or the unexpected left turn i believe he's formulated it the idea being that at some point it'll become much easier to generalize capabilities than to generalize uh morals uh not morals um a lot values goals those kind of objects and the one way of thinking of capability increases is this is a changing and presumably an improvement of your world model so you get a better ability to influence the world as your model of it changes as you understand different approaches and that sort of thing then if you just change the models there and have your naively defined goals will have your goals naively defined in the way they were originally defined then this is going to go hideously wrong but if the goals change as the world model changes then you're not naturally artificially solving part of the alignment problem an example that i had was the a sort of toy model of wireheading where you had images of smiling people non-smiling people and the smiling people had a big happy written on the image and the non-smiling ones had sat written on the image and when the ai had the ability to change the text it was its reward was initially defined on the on the images which were given the label of happy or sad and then it developed the ability to change the text on images that it was producing in whatever way and this led to non-smiling images with happy written on them and the opposite so this is model splintering it has now encountered a bigger its model of the world previously was that the expression and the text were perfectly correlated so there's no real need to distinguish between the two now its model of the world has changed and it can see that these features are no longer correlated they can be different and we've published a benchmark recently and what the benchmark is essentially is upon real life realizing that the model has splintered that the correlated features are no longer correlated it generates multiple candidates for what its reward information could extrapolate to okay so in a sense it has gained capabilities or or knowledge about the world either way either way is a good way of modeling what's happened and now as a consequence it must extrapolate its reward or generate different candidates to deal with its increase in power or its increase in its world model okay so i guess the idea is that presumably we'd have to be able to get two models out of that because it's like not obvious whether we want to be categorizing things based on the smiley faces or based on the red text at the bottom of the screen right yes the fact that it's only two models is a human judgment of what features are important okay and there's actually a rather interesting philosophical computer science insight that has emerged from looking at this this is can i try a very simple example yep hit us you may ha this is the a double m nist one of the simplest possible classification tasks imaginable you have a zero on top of another zero that is labeled zero okay and you have a one on top of another one and that is labeled one and humans instinctively see that there are two features here the digit on top and the digit on the bottom however when you zoom into these images there's actually hundreds of features there's the upper left curve of the zero the upper right curve of the zero every piece of every digit is its own feature that can be used quite successfully as the feature on which you do the classification task so intrinsically this data has many many different features and many many different extrapolations but the ones that you consider or that it is useful to consider depend on how the world model of the ai changes or and on the data that it sees if it sees 0 1 and 1 0 if it sees a lot of those then the two useful features are okay it is the top digit versus bottom digit that is as expected but if it sees things where you get the left half of the top digit and the right half and those are changed and that if that kind of thing happens then there you would hone in on different features so what this means is that the world model or the unlabeled data in this case play the role of telling you which are the possible extrapolations what what breaks what what are the features that break apart and what are the features that don't break apart okay and again this scene yeah so feel free to cut that whole thing out if it's too technical no no no um and i just wanted to say that we have been starting with we i just wanted to say that we started with image classification not because image classification is particularly relevant to this task but because we had got to the point where the theory seemed sufficiently done that we needed to get some practical experience and practical results on what extrapolation is and how it can work to feed back into the theory ultimately so that uh bit that i told you about how the unlabeled data or the world model can determine which are the features that come apart which are the ones that don't this is an insight that is constructed from the practical experience of doing uh the image classification task uh we're aiming to extend it to rl and other environments uh at the moment as well but yes the practical and theoretical benefits of sitting down to do this seem to have been borne out at least are starting to bear fruit okay and can you give us a sense of like what is this the theory that's been developed here so first of all like what what does the theory like tell you it tells you how to accomplish the task of extrapolating beyond the training data as your world model changes and how this might be done in a human safe way okay it's in essence the core of the alignment problem in our view is always going to be some version of concept extrapolation in that what we are trying to do is do a safe survivable flourishing world via ai and none of those concepts are defined with any degree of rigor across the potential futures that we encounter so how these can be extrapolated is going to be a critical part of it we if we want flourishing we have to define what flourishing is and get that definition to extend at least adequately across all possible weird futures that the ai could put us in so there is in a sense in a sense there's no theory that says that this is doable and there's no theory that says that it has to fail what we are as humans is we exist in a variety of environments uh both real and imagined and we have pieces of values and preferences defined across these environments and it is not surprising that say there's huge contradictions between the different parts of our values when we try to formalize them or extend them but you can extend a lot of human values in relatively decent ways more or less complicated depending on your preferences so it's not like there is a true essence of human value out there that we are trying to get to if there was then in a sense concept extrapolation has to succeed or some alignment method has to succeed in theory it's just a question of reaching that goal so it may fail in practice but since there isn't that we can't say that alignment has to work in theory but similarly we can't say that alignment has to fail in theory it's because yes you might make the argument if there is this ideal thing out there that if we can't reach it then it has to fail but since there isn't that it doesn't have to fail and we've seen examples of concept extrapolation in physics and morality that can be more or less good but are tend to be non-disastrous one of my sort of favorite examples of that is how you can go from the sentiments expressed when it was written we hold it self-evident that all men are created equal and get to the suffragettes movements and votes for women it seems to be a slight contradiction or huge contradiction between what was written what was intended and the ultimate goal but there is a relatively straight line that you can follow through history and morality that goes from one to the other so that seems to be a successful extrapolation of a concept of a moral value to a new environment so it can be done in practice and it can fail in practice we also have examples of that so this seems to mean that we need a lot more experience as to how these things succeed and fail in practice okay so yeah so i guess part part of what you're saying there is just like uh related to the fact that to some degree this is inevitable and like the the thing you're aligning to i guess is also concepts that don't obviously that don't have a super clear extrapolation i i guess i'm wondering like you mentioned some idea of a theory of concept extrapolation like like what are the relevant objects in this theory and like what's yeah can you give us a sense for like what the what the results are in this theory or like you know what the relevant properties are of things such that if you have this property then you can extrapolate well or poorly or something let me give you a example it is my strong belief that symbol grounding can be solved in part by concept extrapolation an example that i give for this is imagine if you had an ai trained on videos of happy people and maybe the negative videos of sad people with a lower reward now the standard alignment failure mode here is that it fills the universe with videos of happy people the reason i am saying that symbol grounding may have a may be solvable by concept extrapolation is that even though the wire-headed ideal scene is the best explanation for what's going on for the training data the wire heading is always the best explanation for the training data because it fits perfectly the hypothesis of well there are actual humans out in the world they are sometimes sad and sometimes happy that correspond to these clusters of something or other reactions hormones this is not a theory that is all that complicated to consider if you have a good world model so just having a practical model of the world would suggest that the symbol grounded version of these are not too hard to find they're below the wire-headed one in say probability or simplicity or fit but they're not it's not random fitting this data to actual humans being happy is a lot easier than fitting this data to stock prices or the movements of the moons of jupiter and things of that nature so this is and i'm say this is my strong belief at the moment i have not proven this result okay but this is the kind of thing that would be very valuable to have a a formulation of a uh results both theoretical and experimental and it would make the alignment problem harder or easier depending on what uh the result is sure because in one view at least if we don't change the world too much you have the wire heading solution and you can actually rule that out relatively easily and the next strongest candidate is something that corresponds pretty much to what we want and this would allow us to define humans say humans and humans happiness by here's some humans this is what happiness looks like rather than constructing very very complicated definitions of what these are there's essentially a trade-off that the easier concept extrapolation the if concept extrapolation works fantastically to a level that i don't expect that it will you can kind of get good outcomes by just pointing vaguely at good outcomes in the world now however it if it doesn't work quite as well as that you need to put more training data into it but in any case the ultimate aim is that you do it without having perfect training data because you will never have perfect training data okay so so i guess that's the aim and there's some sense that like you can probably prove things based on like like based on which theories of the world like make more sense or easier to handle or something i guess part of what i'm wondering is like is there are there existing results or theoretical constructs yet or is this like a work in progress um there are theoretical constructs as in my first attempt at the mathematical formulation of model splintering which i imported into category theory which was quite interesting mathematically now the problem with it is that it is sufficiently universal that it is relatively easy to generate toy examples where concept extrapolation succeeds and ones where it fails the failure mainly being there are too many options and the consequences of bad decisions are too great that it just cannot be um solved but how which of these that and the these the easy ones are things closer to physics or um this is a car of a different color kind of approaches where the basically if you want to do a model of the world where concept extrapolation is easy defined fundamental reality with these concepts and then add approximations to it and then the extrapolation will be easy because the extrapolation is already there in a sense okay and it turns out that in physics the extrapolations are already there in most cases general relativity and energy being a potential exception okay but um so which what is the description of our moral our value concepts our preferences because they're not they're clearly not pieces of a single well-defined full human utility function so they it hasn't been built that way but but they are of a maximal complexity the brain take the complexity of the brain take the complexity of all human brains the estimates of every human being on every potential realistic uh short environment that they could encounter that doesn't break their brain the the amount of information that is finite and can that be extended in a safe way to extreme capabilities and extreme new environments i think it can but how easy is it how hard it is this is far more an experimental question than it is a theoretical one i i feel okay so would a summary of that be like there's some theoretical there's some formalism but like it's not it's not expressive enough to there are like the real gains are in just like trying things and seeing what happens um or like checking how how reality actually is it's testing how reality actually is it's also testing how the tools that we use might interact with that reality there are ways like when you start extrapolating there are choices that you can make you can you can rely heavily on human feedback you can try and idealize human feedback along the lines of what humans themselves would consider idealization you can rule out wireheading solutions and try and give a syntactic definition of what wireheading consists of you can use the sort of modal solution at each point uh in a sense so that's more sort of self-directed and simplicity based you can try sort of low-level human values and extrapolate them and then try and fit them within say human meta preferences or you can try the meta preferences first and extend the low level preferences to meet them up i'm mentioning all these because we don't really know which ones will be better okay and some of them may break immediately some of them may lead to theorems as to the conditions under which they break or don't break okay and that would be very useful to have so in that case so you mentioned some kind of formalism that um was maybe connected to category theory where you could come up with these kind of examples and counter examples can you give us a flavor of like what that is so that we can get a sense of concretely like what's going on okay bear in mind that i am not sure that this formalism is the ideal one okay it may be too general it has some universality properties but universality properties and maths are cheap and easy but okay the basic idea is that you start with your features and you have probability distributions over the possible values that the features can take and this defines your world model okay and for example you could have the feature of pressure and temperature and volume and the laws of um the gas the ideal gas laws would be would give you a distribution of how these are related okay then you might move to a different model of say atoms bouncing around and then you can maybe in in this case you can at least statistically port the ideal gas laws um if you cross but maybe actually there are different types of atoms in your new model and that means that the ideal gas laws are no longer accurate you have the uh say the van der vaal gas laws so you relate this is this this is the my description of the universe according to this features this is my probability distribution and this is my translation of this set of features to this set of features and then there is a comparison between what the probability distributions are saying okay so if you just have one type of atom you can take you can get the ideal gas laws as a statistical uh version of the deterministic atoms bouncing around model if you have different types of atoms then the two distributions may or may not correspond exactly depending on how the different types of atoms where they're located and how how you analyze them i think i'm i'm rambling there let's uh take that so when you translate between ideal gas laws and atoms bouncing around you can check whether your say your more advanced one the atoms bouncing around does the probability distribution there give you exactly the the probability distribution from the gas laws the ideal gas laws in that case this is just a pure improvement you've refined your understanding of what's going on but what will generally happen is that it's slightly different you've not only refined your understanding but you found errors in your previous understanding and the category theory comes in when you move from one uh universe of potential features to another universe of potential features okay and you can port the probability distributions back and forth and then you can compare what they look like on the other universe and if they correspond perfectly that means you've just you've done a pure refinement uh you've g you've you're either gaining knowledge in one direction or you're statistically grouping them together in the other direction and that forms a category but there are weaker correspondences where you can learn or approximate or the things don't quite match up that say correspond more to how you went from newtonian physics to relativity so but there is a you can have a distance metric for how well they correspond or how well they correspond in typical environments so this is a formalism with which you can discuss how model splintering works i am not yet sure whether it is a useful one okay it's i did it so that i knew that there was something a theoretical construct out there that everything could be formulated in terms of if i needed to that i wasn't i wasn't sort of going in an area where the the objects fundamentally didn't speak to each other or were of a completely different nature that so i i now i don't have to fear any ontology change because any ontology change including some of the weirdest ones you can do can be formulated in this in this formalism but as i say whether it is useful uh except as a thinking tool i i don't know okay this is going to be something that will be determined more experimentally okay so when we're thinking about concept extrapolation yeah do you have a sense of what it looks like when there's like only one way for a concept to extrapolate like is there something you can say about what has to be true for there to be essentially a unique or maybe just a preferred extrapolation the the easiest ways as i said to get that is to start with your ultimate concept and have the the lower level or the messy ones as just approximations of that ideal one and that that that is physics uh basically where um it turns out that the underlying model of reality is surprisingly simple it could have not turned out that way in which case the extrapolation across concepts would have been more complicated i mean we could go back to the ancient greeks and the conception of the four elements or five elements and how they determined the physics of the time that there is no real extrapolation of what is air and fire from that era to nowadays it's all sort of air is something that goes up and is moist or is it dry it's it doesn't really make much sense but the predictions you could make from that that if you tossed a rock into a pond it would fall that that extrapolated yeah and i still talk about like air and fire you know it's not like those concepts have gone away i think we would agree about for the most part about what stuff i run into in my everyday is air and what stuff is water maybe i mean what about earth and if you are in another planet is that air i mean or i'm not there yeah you know okay but yeah but we're having a discussion here about how these concepts they kind of yeah they do they do in some circumstances they do not don't do in others and yes so the challenge here is that we absolutely want the concepts to extrapolate well because they're we're using them as not just concepts but as our goals so we want it to do well and what is the definition of well which itself is a judgment based on our values which again it has to extrapolate itself so let's be a bit more practical human rights for example have turned out to be easily for the fundamental human rights have turned out to be relatively easily extendable or extrapolatable in the modern world because there is a relatively clear division between what's human and what's not where you get into ambiguous areas like um fetuses and embryos that is where you get some of the biggest fighting on them but if yeah if they were sentience chimpanzee well chimpanzee will probably send if they were increased intelligence chimpanzees that could cope in modern society say or had good language and there was a large community of them and that kind of thing not not human definitely distinct from human possibly stupider in many ways if you wanted to ensure that they were different but their existence would mean the definition of what is and wasn't isn't a human becomes a lot more complicated if no that's a bad example but a better example is what if the neanderthals were still around in large numbers that you had a differently intelligent humanoid species probably stupider in most ways would be my guess but maybe smarter in some ways and different and there the concepts of human rights would become more complicated because maybe there are some things that we value that they don't value at all our vice versa especially maybe in the social arenas so a lot of the ease of extending the concept of human rights or the concept of legal equality has been practical that it turns out that this makes sense it can be defined and then you might think well here are examples neanderthals intelligent chimpanzees that undermine it maybe we can either draw the boundary large or we can say let's not create or not allow the existence of these edge cases that would undermine the central principle but i fear this is getting a bit away from your question legal equality is a very it's a powerful concept that is a distillation of a lot of social urges and preferences across that humans have we have innate equality feelings but they are very not necessarily incoherent but they're very situational they point in lots of different directions you pick a story you can depending on how you present it you can make pretty much any type of behavior the one to be desired or avoided on grounds of equality or other similar sentiments however we have come up with this concept of legal equality and a lot of inequality that is allowed within legal inequality so this gave us a distillation of a lot of our equality fairness values and left other ones aside so this is a more or less successful distillation of something that has a lot of weird shards in human values if you zoom into it basically what i'm saying is if you tossed away the concept of laws and legal equality and then looked into what would count as equality or fairness so toss away the concept of legal equality take the human urges towards equality and fairness and rebuild something from that we may not find something that is all that close to what we uh what we had initially okay it may be that there is other concepts of equality that may have emerged and it may have been more that you can organize society around but we have got to legal equality it is a distillation of a lot of human values and a relatively low information one compared with the amount of information that human preferences to do with fairness have so it is possible to extrapolate very successfully from lots and lots of very incoherent things towards something that didn't didn't exist before it wasn't that there was a legal system and the concept of legal equality and then we got pieces of that that were implanted into uh tribal humans and then we've just rebuilt the thing there it is a construction a generalization an extrapolation and one that is reasonably close to uh its origins okay and i guess this is an example of something where like there are potentially multiple different successful extrapolations and if you want to extrapolate you've just you sort of got to pick one and go with it um not necessarily one can always choose to become conservative across the different extrapolations this is not unusual this a useful practical feature of humans is that we tend to have diminishing returns this allows us to combine different preferences in ways that might otherwise be impossible so what is the value of the mona lisa in terms of human lives is an un undefined question i would say that for people who have not sat down to figure it out it is undefined and it's there are multiple possible answers for them but most of those answers do not lead to uh most of those human answers do not lead to a world without art or to a world without living humans in it okay cool getting back a bit to so so if i think about both that and the uh the happy face data set and challenge it seems like for both of them we're using this idea of concept extrapolation or model splintering how much yeah if i'm thinking like how valid like how much i want to invest in this in the concept of concert concept extrapolation so to speak it seems like one thing i'm going to want to know is like how universal a phenomenon at least that the relevant fixes might be so if we're working on this like happy face data set do you expect that like working on concept extrapolation in that setting is going to tell you much about you know extrapolating the concept of like legal equality for example and also do you think like for each application is are there going to be like new things we need to think about in terms of how concepts extrapolate or like will we eventually just solve concept extrapolation for for all the cases of splintering we expect to run into we might this is a practical question uh rather than a theoretical one if i already knew the answer to it um we would have a decent theory for how to go about it there are already some insights that are gained from the even the very simplistic image classifier the the example i gave you is of what role the world model or the unlabeled data is doing in disambiguating what the possible extrapolations are now this is maybe not you can connect it with philosophical work and computer science work that has already been done but this was to me at least a oh yes of course but i never thought of it that way um so it is already been a generator of insights and the fact that we can do this using relatively dull gradient descent methods we're getting around to writing up the method and publishing it but the fact that we can do it using relatively uh standard gradient descent methods means that it is possible to do this in the sort of standard computer science approach which it wasn't clear that that would be the case at all okay cool i mean i mean i guess it seems like you must think that there's some degree of generalization for how to think about concept extrapolation otherwise you wouldn't like form a word for it and work on this image thing when i gather your heart's desires not actually to classify happy and sad uh faces right really oh yes um [Music] we started with that because it was easiest basically we might have started on a coin run example that was another one that we considered uh though we think that the methods here can be used for the coin run example the coin run to explain was a very simple mario ish platform game in which an agent would bounce around and then find the coin all the way on the right and that would be the victory condition and if you trained agents on this it turned out that generically they didn't learn about the coin they learned about go all the way to the right okay and so if you put the coin somewhere else they would ignore it and they would go to the right so they've failed to extrapolate the reasonable goals uh what the reason goals reasonably could be from the initial data but we started with image classification just because it turned out to be easier or it seemed to be an easier place to start okay cool with some potential commercial applications which is relevant to the for profit and getting these methods out into the world angle of the company okay yeah speaking of things this might be related to how much do you see concept extrapolation as related to this idea of like corrigibility where agents are supposed to like uh allow themselves to be amended if they like you know misgeneralize in new environments or if they you know do something turns out the creator didn't actually want oh i just understood uh one of nate's points uh because of your question oh excellent okay and or how or one of the potential confusions there um i think it is both not related at all and very related okay the relation is that i think that courage ability itself is the extrapolation of the concept of the well-intentioned assistant that is the basic concept that we have examples of for humans and that we are trying to push to um super human thought to the superhuman ai level so if you have a good model of concept extrapolation then this would allow you to push courageability itself upwards but courageability i don't tend to think of the agents as courageable that as in they they extend their world model and then they change their preferences to according to say human feedback it seems a lot more that or it is a lot more that they have a process for how their values should be extended with as their environment changes but this is compatible with bayesianism as in this is compatible with once you've done once you've gone to the end of the process say once you know everything there is to know about the universe or every possible universe you can then work backwards from the your value system to what information about the world tells you about that value system so like when is human feedback reliable that is very easy to do if you have a the real in quotes human value function then human feedback is reliable exactly when it points towards the real human value function is unreliable when it points away from it okay but you don't want to have that way of thinking from the beginning but you want to be compatible with that as in you want to behave in a way that is compatible with a bayesian agent gaining information because if you're not you well it breaks there's a lot of theorems on this this is where things go badly and i feel that courage ability in many ways how it's conceived is a way to try and avoid the bayesian full expected utility maximizer approach you're changing the utility function you're routing tearing out the previous one and replacing with a new one that is more adapted to the situation and the most approaches including uh mine that i done with various indifference methods are trying to [Music] they're in intrinsically non-bayesian whereas concept extrapolation in general is trying to extend it in a way that could be bayesian in retrospect okay so this is a very convoluted way of saying that i see courage ability itself as the extrapolation of a specific concept but that i see the approach of courage corridability as quite different to what concept extrapolation itself is okay courageability seems to be change utilities and avoid the badness that comes from doing this that typically comes from doing this okay though i'm pretty sure that the people who are looking at courageability at the advanced level are thinking in a lot more of a bayesian way but i think most people's conception of what courageability is is different from what value extrapolation is all right but like it seems like the thing you're saying is like uh cordurability is like an extrapolated version of the concept of like deference but also like a lot of courageability approaches involve like you know you're ripping out some preferences and installing a new one whereas you think of concept extrapolation as more like oh you're updating something in a way that like could be rationalized you know in some formalism as bayesian updating but like you don't necessarily already have the prior or whatever installed that's that's what i got from your answer okay that is about a tenth of the length of my answer and much better okay but yeah yeah basically that yeah the next thing i want to ask is is there anything that you wish uh i or other people would ask you about concept extrapolation and your work on it that i have not yet asked let's think the main thing that i think i would like people to get is that a lot of things that people seem to think about concept extrapolation are wrong so the best questions would be do you believe blah or does concept extrapolation do blah which i could say no not at all um i can try a few of those questions okay um i i have a few yeah why don't you yeah so let's okay yeah here's one that i have so the world has a lot of like clocks in it right so for instance like the bitcoin blockchain just keeps on getting longer and like every day it's longer or on the vast vast majority of days it's longer than it's ever been and if you know that like there might be some worry that like concepts are just always going to be splintering because like for any um concept you ever see like you could imagine extrapolating it the quote-unquote normal way when the bitcoin blockchain is one block longer or you know flipping it when the blockchain is one block longer uh does this mean that like model splintering is just always constantly happening yes model splintering is constantly happening and the space of potential concept extrapolations is always huge and growing the space of useful and acceptable ones is not and what i yes so i'm not trying to get every possible extrapolation of human values that could ever exist we are trying to get the ones that not the ones we get some in the family of adequate and survivable ones from the human perspective okay i i'm wondering if the bitcoin blockchain example poses some sort of like or some sort of impossibility result right because like there's a whole bunch of way different ways concepts could extrapolate when you add this new feature and like adding like new features to the world or like stuff you haven't previously heard of can make things like either it could be totally irrelevant as in the bitcoin blockchain case or it can be extremely relevant like uh you know we like turned the consciousness off on everyone or something and so i'm wondering if that poses some sort of impossibility result where like well you know when you add a new feature things can either like basically not have important differences or have extremely important differences part of the difficulty is that human judgment is very clear as to what is irrelevant and what isn't irrelevant here so the examples you've given me obviously one is a lot more relevant than the other but we're using our human judgment for that yeah one thing that you can do is take some diminishing return mix of the possible extrapolations um generally this will turn out to be quite acceptable according to most human evaluations no one really knows what the exact balance between liberty and absence of pain is across the cosmos but we have the the range oh i have to be careful here because i'm talking about the range of the possible trade-off between the two which there is a vast amount of different trade-offs you can do which result in very good worlds but people tend to think of pushing on this okay ignore that physics is maybe a good example not necessarily the laws of physics but all the atoms in this room kind of physics you could say i add more atoms the entropy goes up the combinatorics explodes but we can deal with that so we can simplify or we can take statistical combinations of different approaches and the being conservative and combining the values or preferences or possible preferences in a conservative way take a log of them and sum them for example these these tend to work at least roughly and so the no i don't see the combinatorial explosion of possible concepts as providing all that much of a problem we can cope with this in the physical world and there's no reason that we can't cope with it in the moral world okay so again in the case of like conservatively combining different extrapolations of values um it seems like an issue is to take the bitcoin blockchain example and let's say there's you know some value suppose there's something very simple like the balance in my bank account i want it to be high it seems like you could have diametrically opposed different extrapolations where one extrapolation is i always want it to be high and another is well i want it to be high until the blockchain is a certain length and then i actually want it to be low and then once the blockchain is at length like you can't really conservatively combine those two um those two different extrapolations because there's there's like nothing they agree on that's good right one's just like a sign flipped version of the other in this new domain and it seems like unless you restrict what sorts of extrapolations you're allowed to throw into the conservative combination you're going to potentially run into these issues where you can't like neatly trade things off in the way you'd want to i want to specify first that conservative combination is more a placeholder than a full method i'm not saying that this would always work i'm using it as an example that can be built on the other the other argument is if your true preferences are to have the bitcoin to this point and then it be as low this should leave some traces in your in your preferences uh in your behavior in your what's the the the conservative combination and other more basic approaches doesn't really work with values that are strictly antagonistic yeah which is why say s risk utility functions are particularly bad what are s-risk utility functions uh suffering risk utility functions if there's a utility function that values extreme suffering in a positive way it is very difficult to combine that with one that wants human flourishing and human happiness whereas if you had a paper clipper utility function that is easy to combine with one that wants flourishing you get a you get more human flourishing and you get more paper clips everybody wins but when they're antagonistic like that it is very difficult but if they are antagonistic in terms of values this is where is the evidence for the um let's take the you want the bitcoin to be your bitcoin to go up until a certain date and then afterwards to go down where is the evidence for that if nothing changes apart from the date if you have not gone around saying oh i want to cash out at this date or actually bitcoin is evil i want to destroy all my value or if there is my bank account uh or or your bank account it doesn't but yeah if there is literally nothing changed or if there is literally nothing that seems relevant changed then i say basically ignore the gru and bleen type examples i.e the ones that go in a concept extends and then for some inexplicable reason it flips okay in the same way that we do that with uh in with empirical problems if however there are hints pointing that this might that this might flip then we have evidence for them that's a different situation but in most cases of extrapolation it is a question of trading off various possible values or extrapolations rather than a question of dealing with maybe it's the exact opposite of what the evidence has suggested so far okay so so this would say something like look if you can like value the same features in the same way even when like there's a new feature by default unless there's some indication that like you shouldn't do that okay it's like there was um there's an example i believe from eliezer a few years back about diamonds if you want more diamonds how do you define this across all the the future the how you extrapolate the concept of diamonds is there are many different ways of doing it but the basic idea is that if you go back to the environment on which people collected diamonds initially the values there or the preferences there should be should be clear and if there's no sudden flipping at that point there's no reason to have a sudden flipping of the extrapolation further on okay i guess so the next question i have about the like blockchain length versus consciousness on or off examples is it seems like um yeah the difference between those things is that one of those features is incredibly important to humans and the other another is not and therefore like you know you should like behave differently with respect to the change in blockchain length compared to how you would deal with consciousness turning off yes sorry um oh i feel free to share the insight i forgot that we don't actually we have different um audios yes okay if we are thinking of concept extrapolation a lot of what you're saying makes sense we are thinking mostly of value extrapolation whether it's not the same thing it's value extrapolation is a subset and value extrapolation is when you are aiming to extrapolate the concepts on which your values or your reward functions or your preferences are based okay so it means that we don't care about the extrapolation of the concept of blockchain because no one cares or very few people care intrinsically about blockchains so we don't have to extrapolate every single concept we only are extrapolating things of which human values are closely related to okay but but i mean you have to yourself to extrapolate like um you know how to deal with my bank account balance when the when the blockchain gets longer right um that seems like it's mostly an empirical question i just need to figure out what you want from your uh bank account balance or more precisely what you want from your ability to access that what is the values that drive you to have a certain back bank account and if i can figure those out uh then the connection between the bank account and the blockchain and all that is just empirical i i guess it's you might think that it's not totally empirical uh so partially because you don't know how those values that relate to how i deal with the bank account you know change with a longer blockchain and also i can't like simulate an environment where the bitcoin blockchain is longer because it requires solving difficult computational problems uh yeah so i can't like give examples of what the world would be like if the blockchain were longer to you at least not fully flesh out ones not fully fleshed out ones but i mean you have used this description and i understand it perfectly well or i understand it it's i don't see what the blockchain is doing here we can replace the blockchain with say any chaotic process or any complicated process pretty much it's what are what what the algorithm is fundamentally trying to extrapolate is your values and your preferences and there does not seem to be much evidence that human preferences are tied to complicated objects like the blockchain and that maybe if they are like maybe it's a noise in the neurons is relevant to whether we swerve left or right on a particular moral point then treating it as noise or as stochastic seems to be sufficient yeah i mean i'm not saying that human values actually depend on the length of the blockchain i'm just using that as an example of something that's like you know you're regularly reaching values that you've never seen before and like if you couldn't handle extrapolating across that things would be quite bad and there's one correct extrapolation there's one correct extrapolation yep the correct extrapolation is i don't actually care like if i wanted my bank account balance to go up before the blockchain was so long i still want it to go up yes but i think there's a there are too many things that are entangled in that um analogy what's uh for the record my actual bank account balance is with a normal us dollar bank i [Music] um and could you give us the account number and maybe a sample of your signature and your mother's maiden name this podcast maybe in the the bit after the recording perhaps but the the bitcoin basically goes through a hash function and the content of the hash also depends on people's a lot of decisions across the world i'm not if you say that your behavior depends on the length of the blockchain or on a particular feature of the blockchain you have defined a relatively simple function defining your preferences i'm not saying that i i know i'm this is not a this is not something where i have a particular answer that i'm working towards this is where i'm sort of thinking aloud on yeah uh what what you were saying so for any preference or or for any like type of behavior if you let it branch off the length of the blockchain like you can make it more complicated by you know doing some complicated thing before it was so long and then doing some different complicated thing after it was so long right so um yeah it can but there are first of all you have a complexity cost but i'm not counting on complexity costs to be the things that save us but there is but what i am counting more on is the fact that any most reasonable accounts of human values uh both empirically and theoretically do not have those kind of features if you yes if you told me that this is what future stewart would do and there was no particular reason for this i would say okay this is a failure um so it's but i could imagine a world in which our preferences were more often structured like that it just seems that it isn't the world that we live in so part of the analysis of human references and meta preferences and getting that information into the extrapolation process would rule out those kind of examples yeah i'm asking how would they rule them out by the okay this goes back to my what is it uh version 0.9 uh defining human preferences or um what what do you remember what the name of that thing was no i remember the version 0.9 but research agenda oh yes preference research agenda synthesizing a human's preferences now we have the fact that humans preferences do not in anyone's judgment behave like that how does this fact get inserted into the preferences into the extrapolation process well typically there's two ways you could imagine it going the first one is the explicit way that all human meta preferences are included in this extrapolation process and therefore that rules it out directly but it's not so much that it rules it out is that the human meta preferences point towards how preferences are supposed to behave and this thing is not compatible with the directions that they're pointing in but the other way is the human programmers decisions on what on part of the thing that they define inside their function like a lot of you can rule out a lot of click bait by saying okay if it's a listicle it's likely to be a click bait that's a feature but you're actually making a value judgment on that you're making the value judgment that the sort of things that are listicles are click bait and the reason there's a value is because click bait is intrinsically a value definition it is things that attract people's attention that's a behavior but that is not good for them that the uh that the their attention is attracted that's a preference but people do things like if it's a list it's more likely to be clickbait without realizing the values that they are injecting into the process so i expect that some of the choices made when saying this is how you extrapolate may encode a lot of human knowledge about our preferences without necessarily realizing it okay but but if we're talking about i don't know either concept or value extrapolation i i guess in my head i'm imagining like we're building some ai system it like has access you know it gets to look at like the whole world or at least it has the ability to infer a whole bunch of stuff about the world so i wouldn't have thought that just have this ensure that this thing doesn't have access to this random feature that like the bitcoin blockchain length that you know you don't care about preference wise it doesn't seem like i would have thought that wouldn't be an option um i think you may have misunderstood what i was saying yeah i understood you to be saying something like you know if we don't want this thing like like if there's some distractor feature like blockchain length that you don't want to you don't want to cause splintering then just like build your system so that that's not one of the features um yes but this is more this is at the meta level that the like a property if i was to say that a property of extrapolations of human values is that they don't suddenly flip to their opposites for no reason whatsoever yeah what i was just saying is that ideally this would be a meta preference that was okay let me this is something that i've thought about but not formulated extremely clearly um take i am hoping that there is a syntactic definition of wireheading that you can define wireheading um in a way that does not require looking into values much that the concept of capturing the measurement of your own reward so affecting a simple part of the universe to whereas the actual reward function affects a much larger part of the universe that this can be defined in a syntactic way i.e there might be a formula for it i give it say 40 odds that there's a formula that defines wire heading sufficiently clearly that we can set it aside it would be lovely if that was the case if i do stumble upon this formula i'm going to add it to the definition of value extrapolation now because wireheading is something we don't want i have used my own value judgment to add a syntactic piece of the definition of the process that makes it more value aligned with human preferences in uh in general this does not this is a way that our values can be implicitly encoded i mean it would be explicitly because i'd say that this was that but if this is the case this implicit encoding of the values is not inferior to another approach which is say more value extrapolation or concept extrapolation e on the example of wireheading it would be equally valid or more valid if we if it worked so if i get something add something along the lines by hand of don't assume that the values randomly flick reverse for no reason then whether this works or not is something we can test and we can compare it with other uh our other values or outcomes in various situations but but how do you go on but but how do you operationalize that like if i'm writing if i'm writing my ai that gets to you know walk around in the world like how do you write code that says you know you're allowed to think that values depend on some things but not on the length of the bitcoin blockchain i mean you're allowed to think it's allowed to think anything yeah it's allowed to think stuff but i'm not but if i'm programming it how do i like i mean the the most obvious way of extrapolating it would be some energy style where you have a variety of different criteria like simplicity like not randomly reversing itself like compatibility with a whole load of estimated extrapolated meta preferences and that you would do an energy minimization uh on that okay but but when you say not randomly reversing itself what do you what do you mean by that um in terms of operationalizable concepts um operationalizable concepts i mean choose the if okay so if you just take all of the extrapolations say complexity waited and became conservative across them you would tend to get rid of most of these cases because there would be for the one that starts carrying the opposite of the blockchain there's the one that starts carrying twice as strongly for example okay and there is no particular reason to prefer one over the other so a lot of these are just going to get canceled out by that aspect of it by because you're averaging them or yes because you're averaging them or even with a logarithm normalization but just i mean the we sort of do that in this movement here may have caused a disastrous tsunami in three years time for for listeners reference stewart just waved his arm while holding a mug or it may have prevented it the um we don't consider all these options mainly because there's no reason to think that it goes one way or the other and when doing extrapolations similarly if there is this odd behavior i put quote odd in quotes but it's actually it is odd a lot of the odd behaviors or the pathological behaviors can be defined as yes it might go this way but it might go completely the opposite way as well whereas the sometimes it continues along in a more comparable there's so if we went back to the model that i was discussing before with the the different features and the probability distributions and you looked specifically at a reward function or candidate reward functions you could port those back and forth between say your simplistic model and your more sophisticated model and there you can start having them pay a complexity cost or a in in a way that is pretty close to the way that you can do it for physics okay so just in terms of this averaging out thing why would that not mean like say we're wondering how much i value like human happiness and flourishing in a world where like you know tomorrow versus today well one way to extrapolate it would be i value it the same way tomorrow is today one way to extrapolate it would be why i value it the opposite tomorrow is today there's like a minus sign flip and therefore if you average those out like you know i stop valuing things at all i'm like totally neutral about things right because that's the average of like plus x and minus x is zero basically uh if if there's some averaging out why does it not average out to zero i mean because especially if you have more than one day if you have always valued human life roughly the same then i would argue that the departure of suddenly you don't value human life or suddenly you value human life twice as much are equally valid extensions uh at that point and those are the ones that tend to cancel out but but how does that um i agree those are equally valid and unlikely i'm wondering how does that get modeled like how does that show up without someone just saying well those are two equally valid extrapolations um i mean th this again seems to be exactly analogous to the problem of empiricism in um problem sorry the problem of induction in empirical situations yeah so for the for the problem of induction you can have a prior i didn't have yeah so so i guess maybe what we're doing is we're having some kind of prior over potential concepts so there's two places that i would appeal to to get that prior the first is complexity which helps here uh though as i say i don't want to rely too much on complexity the second is human well okay then there's a stronger version of complexity which is all the different pieces of human values none of them really exhibit this kind of behavior so it would be even more surprising if just this aspect did exhibit what behavior the um suddenly inverting itself for no reason though well because the blockchain got longer not for no reason um but and they define the other thing is the human um meta preferences where we like if we if we had a discussion about what we value then and and you would ask people would they would they prefer to suddenly completely value the opposite when the blockchain gets longer for example people would tend to say no now i know that you can't go from empirical observations to values i i wrote a paper on that but um given some um assumptions about how human statements connect with human values which is and from which you build the rest then this is this then becomes evidence that we don't want our preferences to exhibit this kind of behavior so this is a value related reason for the extrapolation of these lower level values to not exhibit that behavior all right um so i guess we could talk about that much longer but our time is limited earlier you said there were things like basically misunderstandings that people had that you wanted to clear up are there any more of those that you'd like to talk about yes part of it is that a lot of people seem to think that we're relying on human feedback to get to solve the ambiguity between different possible extrapolations it's a method we use now with much dumber ais than than humans but this is ultimately not going to work there are a variety of things that can be done one is to generalize human feedback in an idealized way uh it so itself a form of extrapolation you can become conservative there are various ways for choosing to extend your uh how you extend the the concept of the reward and we're going to be analyzing them but sort of ask the human is more of a practical stop gap for current systems it's not a big aim of the approach and in fact in a sense we feel that the biggest challenge or the most the big one of the bigger challenges here is not choosing the right extrapolation but having the right extrapolation as a candidate so just to go back to the early example if we have an ai that is hesitating between do we want videos of happy humans or do we want actual happy humans if we've got to that well then fantastic either we can let it become conservative between the two or we can tell it that it's the second or most of the battle is already won at that point which is why so much of our early focus is getting the extrapolations or getting the good extrapolations rather than choosing amongst them yeah although even in that case it seems like you have to filter out the crazy extrapolations oh yeah this is a simplified outcome with just those two but it does sort of yeah so a lot of people are focusing on how do you choose from amongst these extrapolations the in practice in a lot of ml machine learning stuff they don't generate any candidates except for one our assembler methods generate a few more bayesian methods generate a huge amount but um they sort of already have a prior ahead of time which is a bit of a cheat in this case but yes let's start by getting some reasonable candidates before we worry too much about the process of selecting amongst them and ask a human is just the current sort of stop gap uh for the current practical problems okay so that's one misunderstanding people have they think that you're trying to that you're focusing on choosing between extrapolations yes the other one is that we think that we can build up from solving image classification to solving ai alignment okay in a sense it is the opposite we have seen the advanced form of value extrapolation concept extrapolation as the way that we plan to solve ai alignment and we're using image classification as a toy version of this to understand how this approach works what are the features what is the theory that we should be building here and so it's not so much a question of building down building up as applying the ideas to the simplest the simplest non-trivial problem that we can apply it to okay yeah i guess there the the worry would be i think if i inhabit that critique or that misunderstanding i would think like well okay you're trying it on this really simple example and then you know you're learning some things and like making some tweaks well maybe like you know when if you're trying to align some like crazy agi uh that will also have some like new tweaks or you know new things to learn that weren't already contained in the image classification case oh i i see the the point yes it's not yes it's definitely not taking the learning approach that we have on image classification is not the approach of how you would align a dangerous super intelligence it's we can tweak and we can experiment and we can play around with it precisely because it's not an agi that we're dealing with this is in a sense practical theory building if that makes uh if that makes sense sure the if we have a system to deploy on an agi of great power then it would not be because oh well it works with images so it'll probably work with this agi it would be because our work with images and other things have led to the development of a theoretical framework that we think is sufficiently solid that it can work with an agi okay so so the idea is like uh you know you have some sort of proto theory you play with image classifiers you come up with like a theory you like think about okay is this very solid do i think this theory covers like what it has to cover and like then use it on agi and uh toy examples have been very useful in both philosophy and in physics in the past relativity came from ideas of what if you were on a cannonball that was falling through the earth and that kind of and yeah you can get a lot from uh toy examples when it comes to building theory in fact a lot of the examples and counter examples in alignment at the moment are essentially toy examples okay so i guess my final question is if people are interested in following your work and you know more stuff about concept extrapolation or whatever else you might work on uh how should they do so the two easiest ways are to go to our website buildaline.ai and sign up for a periodic newsletter or look at les wrong or the alignment forum to see some of the things that we post there currently there's also a benchmark and a challenge which is can you disambiguate features better than what we've done and if you can i would love you and please send me please get in contact about that okay and how can people find you on less wrong or the alignment form stuart armstrong is the username okay great well thanks for coming on the show thank you for inviting me here it's it's been interesting and i think i have developed some of the ideas in the course of this conversation excellent to the listeners i hope uh it was interesting for you also this episode is edited by jack garrett and amber dawn ace helped with transcription the opening and closing themes are also bad shotguard the financial costs of making this episode are covered by a grant from the long-term future fund to read a transcript of this episode or to learn how to support the podcast you can visit axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net [Music] [Laughter] [Music] [Music] you
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs