Library / In focus

Back to Library
AXRPCivilisational risk and strategy

Natural Abstractions with John Wentworth

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through Natural Abstractions with John Wentworth, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 90 full-transcript segments: median 0 · mean -1 · spread -180 (p10–p90 -100) · 1% risk-forward, 99% mixed, 0% opportunity-forward slices.

Slice bands
90 slices · p10–p90 -100

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes alignment
  • - Emphasizes safety
  • - Full transcript scored in 90 sequential slices (median slice 0).

Editor note

A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.

ai-safetyaxrpcore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video E7JH77LMuV0 · stored Apr 2, 2026 · 2,972 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/natural-abstractions-with-john-wentworth.json when you have a listen-based summary.

Show full transcript
[Music] hello everybody today i'll be speaking with john wentworth an independent ai alignment researcher who focuses on formalizing abstraction for links to what we're discussing you can check the description of this episode and you can read the transcripts at axrp.net well welcome to accept john thank you thank you to our live studio audience for showing up today um so i guess the first thing i'd like to ask is i see you as being interested in resolving confusions around agency or things that we don't understand about agency so what don't we understand about agency all right well let's let's start like chronologically from where i started i started out first sort of two directions in parallel uh one of them was in biology like systems biology and some extent synthetic biology looking at uh e coli and biologists always describe e coli as like collecting information from their environment and using that to keep a sort of internal model of what's going on with the world and then making decisions based on that in order to achieve good things so they're using these very agency loaded intuitions to understand what's going on with the e coli but the actual models they have are just sort of like these simple dynamical systems occasionally with some feedback loops in there they don't have a way to like take the sort of low-level dynamics of the e coli and back out these uh sort of agency primitives that they're talking about like goals and world models and stuff sorry e coli it's a single-celled organism right is the claim that the single cells are like taking information about their environment and like yes one simple example of this is chemotaxis so you've got this e coli it's swimming around uh in like a little pool of water and you drop in a sugar molecule and there will be a sort of chemical gradient of sugar that like drops off as you move away from the grain of sugar and the e coli will attempt to swim up that gradient uh which is actually an interesting problem because when you're at a length scale that small the e coli's measurements of the sugar gradient are extremely noisy yeah so it actually has to do pretty good tracking of that sugar gradient over time to keep track of whether it's swimming up the gradient or down the gradient and is it is it using some momentum algorithm or is it just accepting the high variance and hoping it'll wash out over time uh it is essentially a momentum algorithm okay yeah which is basically like uh roughly yeah continuing to move in the direction it used to move yeah basically if it tracks the sugar concentration over time and if it's trending upwards it keeps swimming and if it's trending downwards it just sort of stops and tumbles in place and then goes in a random direction and it this is this is enough for it like if you drop a sugar molecule and a tray of e coli they will all end up converging towards that sugar molecule pretty quickly sorry not molecule sugar cube sure so how could a single cell organism do that i mean it's got 30 000 genes there's a lot of stuff going on in there the specifics in this case what you have is there's this little sensor for sugar and there's like a few other things it also keeps track of uh it's this this molecule on the surface of the cell and it will attach phosphate groups to that molecule and it will detach those phosphate groups at a regular rate and attach them whenever it senses like a sugar molecule or whatever so you end up with the equilibrium number of uh sugar mol of phosphate molecules on there is sort of tracking what's going on with the sugar outside and then there's a adaptation mechanism so that like if the sugar concentration is just staying high the number of phosphates on each of these molecules will sort of drop back to baseline but if the sugar concentration is increasing over time the number of phosphates will stay at a high level and if it's decreasing over time the number of phosphates will stay at a low level so it can keep track of whether concentrations are increasing or decreasing and so and so somehow it's using the structure of the fact that it has a boundary and it can just like you know have stuff on the boundary that represents stuff that's happening like that point of the boundary and then it can move in a direction that it can sort of sense yeah and then obviously there's like some biochemical signal processing downstream of that but like that's the basic idea okay all right that's pretty interesting so uh but you were saying that you were learning about e coli and i guess biologists would say that it had it was like making inferences and had goals in something but their models of the e coli were very simple dynamical systems and i imagine you're going to say something going from there yeah so then like you end up with these giant dynamical systems models where like it's very much like looking at the inside of a neural net like you've got these huge systems that just aren't very interpretable and like intuitively it seems like it's doing agenty stuff but we don't have a good way to go from like the the low-level dynamical systems representation to like it's trying to do this it's modeling the world like that so that was that was the biology angle i was also at around the same time looking at markets through a similar lens like financial markets and the the the question that got me started down that route was you've got like the efficient market hypothesis saying that you know these these financial markets are extremely unexploitable you're not gonna be able to get any money out of them okay and then you've got the coherence theorems saying like when you've got a system that does trades and stuff and you can't pump any money out of it that means it's an expected utility maximizer okay so then like obvious question what's the utility function and the bayesian model for a financial market right like the coherence theorems are telling me they should have one and if i could like model this financial market as being an expected utility maximizer that would make it a lot simpler to reason about in a lot of ways right like it would be a lot simpler to answer various questions about what it's going to do or like how it's gonna work sure so i dug into that for a while uh and eventually ran across this result called non-existence of a representative agent which says that lo and behold even ideal financial markets despite your complete inability to get money out of them they are not expected utility maximizers so what what axioms of the coherence theorems do they violate so the coherence theorems implicitly assume that there's no path dependence in your preferences and it turns out that once you allow for path dependence you can have a a system which is inexplicable but it does does not have an equivalent utility function uh so for instance in in a financial market what could happen is uh so you've got apple and google stocks and you start out with a market that's holding like 100 shares of each so you've got like this market of traders and an aggregate they're holding 100 shares each of apple and google okay and this just means that like apple has issued 100 stocks and google has issued 100 stocks right so in principle there could be some stocks held by entities that are outside of the market that just like never trade exactly okay cool uh yeah and i guess in practice probably like founder shares or in practice there's there's just tons of institutions that hold things long-term and never trade so they're like out of equilibrium with the market most of the time okay but anyway you've got like everyone that's in the market and sort of keeping an equilibrium with each other and trading with each other regularly they're holding 100 shares each now they trade for a while and they end up with 150 shares of apple and 150 shares of google in aggregate right and then the question is like at what prices are they willing to continue trading so like how much are they willing to trade off between apple and google are you are they willing to trade one share of apple for one share of google are they willing to trade two shares of apple for one share of google like what's what trade-offs are they willing to accept and it turns out that what can happen is depending on the path by which this market went from 100 shares of apple and 100 shares of google to 150 shares of apple 100 shares of google 150 shares of google uh one path might end up with them willing to accept a two to one trade the other path might end up with them willing to accept the one-to-one trade so like which path the market followed to get to that new state determines which trade-offs they're willing to take in that new state and how does that can can you tell us like why like what's an example of one path where they traded one-to-one in one path where they traded two to one i can't give you a numerical example off the top of my head because it's messy as all hell okay but the the basic reason this happens is that it matters how the wealth is distributed within the market so like if a bunch of trades happen which leaves someone who really likes apple stock with a lot more of the internal wealth like they they end up with more aggregate shares then you're going to end up with prices more heavily favoring apple whereas if it's trading along a path that leaves someone who likes google a lot with a lot more wealth then you're going to end up more heavily favoring google okay so that makes sense all right so we've mentioned that um in biology we have these e coli cells that are somehow like pursuing goals and like making inferences or something like that but the dynamic models of them seem very simple and it's not clear how that's what's going on there yep and we've got financial markets that like you might think would have to be coherent expected utility maximizers because of math but they're not but they're still like you're still getting a lot of stuff out of them that you wanted to get out of corresponding utility maximizers and you think there's something we don't understand there yeah so like in both of these you have people bringing in these intuitions and talking about the systems as like having goals or modeling the world or whatever like people talk about markets this way people talk about organisms this way people talk about neural networks this way and they sure do intuitively seem to be doing something agent-y right okay like these these seem like really good intuitive models for what's going on but we don't know the right way to translate those like the the low-level dynamics of the system into something that actually like captures the semantics of agency all right is that like roughly what you're interested in basically a researcher yeah so like what i want to do is be able to look at all these different kinds of systems and be like intuitively it looks like it's modeling the world like how do i back out something that is semantically its world model okay cool and why is that important uh well i'll i'll start with the biology and then we can like move an allergy back to ai okay so like in biology for instance the the problem is an e coli has uh what 15 20 000 genes a human has 30 000 it's just this very large dynamical system which like you can run it on a computer just fine but you can't really get a sort of human understanding of what's going on in there just by looking at the low level dynamics of the system on the other hand it sure does seem like humans are able to get an intuitive understanding of what's going on in e coli and they do that by thinking of them as agents this is how most biologists are thinking about them how most biologists are driving their intuition about like how the system works right okay so to carry that over to ai you've got the same thing with neural networks they're these huge completely opaque systems of giant tensors they're too big for a human to just like look at all the numbers and the dynamics and understand intuitively what's going on but we already have these intuitions that like these systems have goals these systems are trying to do things uh or at least we're trying to make systems which have goals and try to do things and it seems like we should be able to turn that intuition into math right like the intuition is coming from somewhere it didn't come from just like magic out of the sky like if it's understandable intuitively it's going to be understandable mathematically okay and so i guess there are a few questions i have about this so one is like one thing you could try to do is you could try and really understand ideal agency or like optimal agency like what would be like the best way for an agent to be and i think the reason people like to do that is they're like well you know we're gonna try and build agents well uh we're gonna try and have agents that figure out how to be the best agents they can be so let's just figure out like you know what the best possible thing is in this domain whereas i see you was more saying like okay let's just understand the like agency stuff we have around us like e coli but i think like an intuition you might have is like well yeah optimal like super powerful ai that we care about uh understanding maybe that's gonna be more like ideal ai than it's going to be like e coli yeah i'm wondering yeah what your take is on that so first of all from my perspective the main reason you want to understand ideal agents is because it's a limiting case which you expect real agents to approach right like you would expect humans to be closer to an ideal agent than an e coli and for you expect both of these to be much closer to an ideal agent than like iraq and so it's just like in other areas of math like we we understand the general phenomenon by thinking about the limiting case which everything else is sort of approximating all right now from that perspective there's the question of how do we like intuitively or like how do we get bits of information about what that ideal case is going to look like before we've worked out the math it's like in general it's hard to find ideal mathematical concepts and principles like your brain sort of has to get the concepts from somewhere and get them all loaded into your head before you can actually figure out what the math is supposed to look like right we don't go figuring these things out just by doing brute force search on theorem space we go and figure these things out by having some intuition for how the thing is supposed to work and then sort of like backing out what the math looks like from that and if you want to know like what ideal agency is going to look like do you want to get bits of information about that the way you're going to do that is go look at real agents right and that also means if you go look at real agents and see that they don't match your current math for ideal agents that's a pretty strong hint that something is wrong with that math so like the markets example i talked about earlier we had these coherence theorems that we're talking about ideal agents they should end up being expected utility maximizers and then we go look at markets and we're like well these aren't expected utility maximizers and yet we cannot money pump them what's going on here right yeah yeah then what we should do based on that is like back propagate be like okay why why did this break our theorems and then go update the theorems so like it updates our understanding of what our of what the right notion of ideal agency is okay that makes sense i guess the second question i have is why think of this as a question of agency exactly so so we're in this field called artificial intelligence or i am i don't know if you consider yourself to be but uh know you might think that oh the question is we're really interested in smart things what's up with smartness or intelligence um or you might think like oh i want to understand microeconomics or something uh like how how do i get a thing that like achieves its goal in the real world by like making trades with other intelligent agents or something um so so yeah why pick out agency as the concept to really understand so i wouldn't say that agency is the concept to understand so much as like it's the name for that whole cluster okay like if you're if you're isaac newton in 1665 or 66 or whatever it was like trying to figure out basic physics right you've got concepts like force and mass and velocity and position and acceleration and all these things like there's a bunch of different concepts and you sort of have to figure them all out at once and see the relationship between them in order to know that you've actually got the right concept right like f equals m a that's when you know you've nailed down the right concepts here because you have this nice relationship between them right so there's these different concepts like intelligence agency optimization world models where we kind of need to figure these things out more or less simultaneously and see the relationships between them confirm that those relationships work the way we intuitively expect them to in order to know that we've gotten the models right all right make sense i think that makes sense and thinking about how you think about agency i get the sense that you're interested in selection theorems yeah that was a bad name i really need better marketing for that okay anyway go on well uh why are they important and maybe what's what's the right name for them okay so the concept here is when you have some some system that's selected to do something interesting so like in evolution you have like natural selection selecting organisms to reproduce themselves or in a.i we're using stochastic gradient descent to select a system which optimizes some objective function on some data it seems intuitively like there should be some general principles that apply to systems which pop out of like selection pressure right so for instance the coherence theorems are an example of something that's trying to express these sorts of general principles like they're saying if you are pareto optimal in some sense then you should have a utility function right and that's the sort of thing where if you have something under selection pressure you would expect that selection pressure to select for using resources efficiently for instance and therefore things are going to pop out with utility functions or at least that's the implicit claim here right and the thing i chose the poor name of selection theorems for was this like more general strategy of trying to figure out properties of systems which pop out of selection like this so more general properties for instance in biology we see very consistently that biological organisms are extremely modular all right some of your work has found something similar in neural networks it it seems like a pretty consistent pattern that modularity pops out of systems under selection pressure so then the question is like when and why does that happen right that's a general property of systems that are under selection pressure and we want to understand that property when and why does it happen okay and thinking of much larger specifically like how how satisfied are you with our understanding of that like like do you think we have any guesses or are they good uh so first of all i don't think we even have the right language yet to talk about modularity like for instance in when when people looking at modularity in neural networks they're usually just using these like graph modularity measures yeah which are sort of like they're like when i do it yeah i mean everybody else does the same thing yeah like it's it's sort of like a hacky notion that you know it's it's enough to if you if you see this this hacky measure telling you that it's very modular then you're like yeah all right that's pretty modular but it doesn't really feel like it's the right way to talk about modularity like what we want to talk about is something about like information flowing between subsystems right which we don't have a nice way to quantify that and then the second step on top of that is once you do have the right language in which to talk about it you want to also like you'd expect to find theorems saying things that like like uh if you have this sort of selection pressure then you should see this sort of modularity pop out i i guess like an obvious concern is like if we don't have the right language for modularity then presumably there's no theorem in which that language appears and so they're not talking about the true modularity it's not necessarily that there'd be no theorem uh you'd you might expect to find like some really hacky messy theorems but like it's generally not going to be real robust like you're if it's the wrong concept of modularity there's going to be systems that are modular but like don't satisfy this definition or systems which do satisfy this definition but are like not actually that modular in some important way right and it's going to be sort of across purposes to the thing that the selection pressure is actually selecting for so you're going to get relatively weak conditions in the theorem or like you or sorry you'll need overly restrictive preconditions for the in the theorem and you'll get not very good post conditions right okay cool so the theorem is going to assume a lot but not produce a lot exactly yeah so uh speaking of modularity um i understand like you a lot of your work is interested in this idea of abstraction i'm not quite sure if like you think of abstractions closely related to modularity or analogical or just like another thing out there they are distinct phenomena the way i think about them like abstraction is mostly about the structure of the world whereas modularity in the like the kinds of modular i'm talking about in the context of selection theorems and mostly about the internal structure of uh of these systems under selection pressure okay that said they do boil down to like very analogous concepts like it's very much about interacting more with some chunk of stuff and not very much with the stuff elsewhere right yeah and it seems like you might you might also have for some kind of connection where like like if the real world supports some kind of good abstractions you might hope that i organize my cognition in a sort of modular way where like uh you know i've got one module for abstraction or something like that so this is this is exactly why it's important that they're distinct concepts uh because you want to have a non-trivial claim like the internal organization of the system is going to be selected to reflect the external uh natural abstractions in the environment cool so before we get into what's going on with abstraction first i'd like to ask why should we care about abstraction and like how does it relate to the task of understanding agency so there's a general challenge here which is in like communicating why it's important which is you kind of have to go a few layers down the game tree before you run into it so like if you're a biologist starting out trying to understand how e coli are agents or if you're an ai person just working on interpretability of neural nets or if you're at miri trying to figure out principles of decision theory or embedded agents or whatever it's not immediately obvious that abstraction is going to be your big bottleneck right you start to see it when you're like trying to formulate what an agent even is and there's like the cartesian boundary is the key thing here like the boundary between the inside of the agent and the outside of the agent like you're saying this part of the world is the agent the rest of the world is not the agent it's the environment but that's like a very artificial thing right there's no sort of physically fundamental barrier that's the cartesian boundary in the physical world right sure so then the question is like where where does that boundary come from like it sure conceptually seems to be a good model and that's that's where abstraction comes in like an agent is an abstraction and that cartesian boundary is essentially an abstraction boundary it's like the the conceptual boundary through which this abstraction is interacting with the rest of the world all right so that's like a sort of first reason you might run into it if you're thinking specifically about agency actually so going there if an agent is an abstraction it doesn't seem like it has to follow that we need to care a lot about abstractions just because like like the word agent is a word but we don't have to worry too much about like the philosophy of language to understand agents just because of that right potentially yes so there's certainly like concepts where we don't have to understand the general nature of concepts or of abstraction in order to formalize the concept yeah like in physics we have force and mass and density and pressure and stuff right like those those we can operationalize very cleanly without having to get into how abstraction works on the other end of the spectrum there's things where we're probably not going to be able to get like clean formulations of the concept without understanding abstraction in general like water bottle okay or sunglasses for those not aware uh john has a water bottle and sunglasses near him and is waving them kindly and so you could imagine that agent is in that agency is in the former category where like there are nice operationalizations we could find without having to understand abstraction in general yeah practically speaking we do not seem to be finding them very quickly and i certainly expect that routing through at least some understanding of abstraction first will get us there much faster in part i expect this because i've now spent a while understanding abstraction better and i'm already seeing the returns on understanding agency better okay cool one thing that seems kind of strange there oh maybe it's not strange but naively it seems strange that like there there's some spectrum from like things where you can easily understand the abstraction like pressure to things where like abstraction fits but just like close to our border of understanding what an abstraction is and there's a zone of things which just like not actually good abstractions uh which unfortunately i can't describe because if there were things for which there are good words tend to be useful abstractions but like well there are words that are under social pressure to be bad abstractions if i want to get into politics yeah well i don't i i guess in the study of like uh sometimes in philosophy people talk about the composite object that's like this particular around a bit of jupiter and this particular around a bit of mars in this particular around a bit of your hat not a good abstraction yep so there's this whole range of like how well things fit into abstraction and like it seems like agencies in this like border zone uh what's what's up with that isn't that like a weird coincidence uh i don't think it's actually in a border zone i think it's just like squarely on the clean side of things okay we just haven't yet figured out the language and like doing it the old-fashioned way is really slow one of my go-to examples here is shannon inventing information theory right right he found these these really nice formalizations of some particular concepts we had and he spent like 10 years figuring these things out and was like this incredible genius uh we want to go faster than that like this that guy was like a once in 50 years maybe once in a hundred years genius and like figured out four or five important concepts all right we we we want to be we want to get an iterative loop going and not be like sort of trundling along at this very slow pace of theorizing all right do you think there's just like a large zone of things that are squarely within the actual concept of abstraction but like our understanding of abstraction is something like a low dimensional manifold like a line on a piece of paper where the line takes up way less area than the paper i'm not sure what you're trying to gesture at uh so do you think that that it's just something like there are tons of concepts that are solidly abstractions but we don't have solid language to understand why they're abstractions so it's not surprising that agency is one of those yes i should distinguish here between like there's things which are mathematical abstractions like agency or information and there's things that are physical abstractions like water bottle or sunglasses the mathematical ones are somewhat easier to get at because in some sense all you really need is the theory of abstraction to like and then the rest is just math yes the physical ones potentially you still need a bunch of data and a bunch of like priors about our particular world or whatever all right so that was one way we could have um got to the question of abstraction you said there was something to do with biology oh there's absolutely other paths to this so like this was roughly the path that i originally came in through you could also come in through something like if we want to talk about human values and have like robust formulations of human values the things that humans care about themselves tend to be abstractions like i care about uh i care about pan my pants and uh windows and trees and stuff i don't care so much about quantum field fluctuations quantum field fluctuations lots of them most of them are not very interesting so like the things humans care about are abstractions so if you're thinking about like value learning like learning human values having a good formulation of what abstractions are is going to massively exponentially narrow down the space of uh things you might be looking for like you know we care about these particular high level abstractions you no longer have to worry about like all possible value functions over quantum fields right sure you could worry about macroscopic things similarly for things like uh talking about impact the sort of impact we care about is like things affecting sort of big macroscopic stuff we don't care about things affecting uh rattlings of molecules in the air right so again there it's all about the the high level abstractions are the things we care about so we want to measure impact in terms of those and if you have a good notion of abstraction a robust notion of abstraction then you can have a robust notion of impact in terms of that okay cool and so what would it look like to understand enough about abstraction that we could satisfy multiple of these like divergent ways to start caring about abstraction yeah so one one way to think of it is you want a mathematical model of abstraction which very robustly matches the the the things we think of as abstraction so for instance it should be able to look at shannon's theory of information and like these measures of mutual information channel capacity and say ah yes these are very natural abstractions in mathematics it should also be able to look at a water bottle and be like uh ah yes in this universe the water bottle is a very natural abstraction and this should like robustly extend to all the different things we think of abstractions it should also capture the intuitive properties we expect out of abstractions so like if i'm thinking about water bottles i expect that i should be able to think of it as a fairly high level object maybe made out of a few parts but like i don't have to think about all the low-level goings-on of the material the water bottle is made of in order to like think about the water bottle right i can pick it up and move it around nice thing about the water bottle is it's a solid object so there's like six uh parameters i need to summarize the position of the water bottle even though it has like a mole of particles in there right yeah well i guess you also need to know whether the lid is open or closed yep captioning not dimensions to available the water bottle in this sense yeah way fewer than that i'm not talking about definitely not a mole okay cool so basically something like uh we want to understand distraction well enough to know like just take anything and be like okay what are the good abstractions yes and like know that we know that and this should like robustly match our our intuitive concept of what abstractions are so when we go use those intuitions in order to like design something it will robustly end up working the way we expect it to in so far as like we're we're using these mathematical principles to guide it okay at this point i'd like to talk about your work on abstraction which i i think of it under the heading is the natural abstraction hypothesis so first of all can you tell us what is the natural abstraction hypothesis so there's a few parts to it the first part is the idea that our physical world has some broad class of abstractions which are natural in the sense that like a lot of different different minds a lot of different kinds of agents under selection pressure in this world will converge to similar abstractions or roughly the same abstractions so that's the first part second part obvious next guess based on that is that these are basically the abstractions which humans use for the most part and then the the third part which is where like the actual math i do comes into it is that these abstractions mostly come from information which is relevant at a distance so like there is relatively little information about any little chunk of the universe which propagates to chunks of the universe far away all right and what's like the status of the natural abstraction hypothesis is it a natural abstraction theorem yet or like how much of it have you knocked off so i do have some pretty decent theorems at this point within worlds that are that are similar to ours in the sense that they have local dynamics so like in our world there's a light speed limit you can only directly interact with things like nearby in space time anything else has to be indirect interactions that might propagate right in universes like that i have this cool result called the telephone theorem which basically says that the only information which will be relevant over a long distance is information that is arbitrarily perfectly conserved as it propagates far away so that means that most information presumably is not arbitrarily perfectly conserved all of that information will be lost it gets wiped out by noise all right i guess one question is um so part of the hypothesis was that a lot of agents would be using these natural distractions what counts as a lot and what class of like agents are you imagining good question so first of all this is like not a question which the math is explicitly addressing yet this is intended to be addressed later like i'm mostly not explicitly talking about agents yet okay that said very recently i've i've started thinking about agents as things which optimize stuff at a distance so in the same way that abstraction is talking about information relevant at a distance we can talk about agents which are optimizing things far away and presumably most of the the like agency things that are interesting to us are going to be interesting because they're optimizing over a distance if something is only optimizing like a tiny little chunk of the world in a one centimeter cube around it we mostly don't care that much about that like we care insofar as it's affecting things that are far away so in that case this thing about like only so much information propagates over a distance well obviously that's going to end up being the information that you end up caring about influencing for optimization purposes and indeed the math bears that out it's you end up if you're trying to optimize things far away then it's the abstract sins that you care about because like that's the only channel you have to interact with things far away sure what's the relevant notion of far away that we're using here so uh we're modeling the world as a big causal graph and then we're talking about nested layers of markov blankets in that causal graph so physically you could in our particular universe our our causal graph follows the structure of space time things only interact with things that are nearby in this sort of little four-dimensional sphere of space-time around the first thing so you can imagine like little four-dimensional spheres of space-time like nested one within the other so you've got like a sequence of nested spheres and this would be a a sequence of markov blankets that as you go outwards through these spheres you get further and further away that's that's the notion of distance here now it doesn't necessarily have to be like spherical like you could take any surface you want they don't even have to be like surfaces in our like four dimensional spacetime they could be more abstract things than that the important thing is that like each of these blankets is the the things inside the blanket only interact with the things outside the blanket via the blanket itself like has to go through the blanket so it sounds like the way this is going to work is that you're um you you can get a notion of something like locality out of this but it seems like it's going to be difficult to get a notion of like is it three units of distance away or is it 10 units of distance away because like your concentric spheres can be like closely packed or loosely correct right yeah so uh in general like the first of all the theorem doesn't explicitly talk about distance per se it's just talking about limits as you get far away okay uh i do have other theorems now that like don't require those those nested sequences of markup blankets at all but like there's yeah it's it's not really about the the distance itself it's about like defining what far away means okay sure that makes sense and i guess one thing which occurs to me is i would think that sometimes systems could like have different types of information going in different directions right so you might imagine i don't know a house with uh two directional antennas and like in one it's beaming up the simpsons and the other it's beaming out like uh i don't know the heart rate of everyone in the house or something and then you'll also have a bunch of other directions which aren't beaming out anything yeah yeah so what's up with that naively it doesn't seem like it's handled well by yourself so if you're thinking of it in terms of these nested sequences of markov blankets you could like draw these blankets going along one direction or along a different direction or along another direction right and depending on which direction you draw them along you may get different abstractions corresponding to information which is conserved or lost along that direction basically there's the the versions of this which don't explicitly talk about the markov blankets instead you just talk about everything that's propagating in in every direction and then you'll just end up with like some patch of the world that has a bunch of information shared with this one and if there's like a a big patch of the world that means it must have propagated to a fairly large amount of the world right all right so sorry and that one is it just saying like yeah here's a region of the world that's like somehow linked with uh my house yes like if we take the example of the directional antenna beaming the simpsons from your house uh there's going to be like some chunk of chunk of air going in like a sort of line or cone or something away from your house which has a very strong signal of the simpsons playing and that's all all that that whole region of space time is going to have a bunch of mutual information that's like the simpsons episode right okay and then the stuff elsewhere is not going to contain all that information so there's this natural abstraction which is sort of like localized to this particular cone in the same way that for instance if i have a physical gear and a gearbox there's going to be a bunch of information about the rotation speed of that gear which is sort of localized to the gear yeah but like it won't be localized to somebody standing outside the box exactly okay so can you just spell it out for us again like why does the natural abstraction hypothesis matter so if you have these natural abstractions that lots of different kinds of agents are going to converge to then first that gives you it gives you a lot of hope for some sort of interpretability or some sort of like being able to communicate with or understand what other kinds of agents are doing so for instance if the natural abstraction hypothesis holds then we'd expect to be able to look inside of neural networks and find representations of these natural abstractions right if we're good enough looking if we're if yeah if we like have the math and we're we know how to use it right yeah like it should be possible another example there there's this puzzle babies are able to figure out what words mean with like one or two examples right like you show baby an apple you say apple pretty quickly it's got this apple thing down however if you think about like how many possible apple classifier functions there are on a one megabyte image yeah how many such functions are there it's going to be like 2 to the two to a million sure you in order to learn that apple classifier by brute force you would need about two to a million bits two to a million samples of does this contain an apple or not babies do it with one or two so like clearly the place we're getting our concepts from is not just like brute force classification like we have to have some sort of prior notion of what sorts of things are concept d what sorts of things we normally attach words to otherwise we wouldn't have language at all it just big-o algorithmically would not work and the question's like all right what's going on there how how is baby doing this yeah and natural abstractions would provide like an obvious natural answer for that yeah i guess like it seems like this is almost like so in statistical learning theory sometimes people try to um prove generalization bounds on learning algorithms by saying well there's only a small number of things this algorithm could have learned and the true thing is one of them so if it gets the thing right on the training data set then like there's only so many different ways you could wiggle on the test data set and because there's a small number of them then it's probably going to get the right one on the test data set right and there's this famous problem which is like neural networks can express a whole bunch of stuff but like yeah it seems like one way i could think of the natural abstraction hypothesis saying well like they'll just tend to learn the natural abstractions which is the smaller hypothesis class and like that's why uh statistical learning theory can work at all yeah so this is tying back to the selection theorems thing one thing i expect is a property that you'll see in selected systems is you'll get either some combination of broad peaks in the parameter space or robust circuits or robust strategies so like strategies which work even if you change the data a little bit some combination of these two i expect selects heavily for natural abstractions yeah like you can have optimal strategies which just like encrypt everything on the way and decrypt everything on the way out and it's total noise in the middle so like clearly you don't have to have a lot of structure there just to get an optimal behavior but when we go look at systems they like humans sure do seem to converge on like similar abstractions like we have this thing about the babies yeah so it seems like that extra structure has to come from some combination of broadness and robustness rather than just optimality alone yeah i said there are actually two things this reminds me of but i hope you'll indulge me and maybe comment on so the first thing is um there's this finding that neural networks are really good at classifying images but you can also train them to optimality on like just make images by selecting random values for every pixel and like giving them classes and like neural networks will in fact be able to memorize that data set yep but it takes them a bit longer than it takes them to learn like actual data sets of real images which seems like it could be related to somehow it's harder for them to find the unnatural abstractions yep uh the second thing is i have a colleague well i call you in the broad sense he's a professor at the university of oxford called jacob fester who's interested in coordination particularly zero short coordination so imagine like i trained my a i bought e e-train your air but he's interested in like okay what's like a good rough algorithm we could use where like even if we had slightly different implementations and slightly different you know batch sizes and learning rates and stuff our bots would still get to play nicely together and a lot of what he's worried about is ruling out like these strategies that um well like if i just trained a bunch of bots that learned to play with themselves then they would learn these like really weird looking arbitrary encoding mechanisms um like if i if i jiggle my arms slightly this funny bees do a little dance that tells the other bees where to fly he actually uses exactly that example nice uh or he did in the one park i attended um and he's like no we've got to figure out how to make it not do that kind of crazy stuff so yeah i'm wondering if you thought about the relationship between natural abstraction hypothesis and like coordination rather than just like a single agent yeah so i've mostly thought about that in the concept of like how how are humans able to have language that works at all yeah but it's the same principles sure like you've got this giant space and you need to somehow coordinate on something in there like there has to be some sort of prior notion of what like the right things are cool one question i have a thing that seems problematic for the natural hypothesis abstraction hypothesis is that sometimes people use the wrong abstractions and then they change their minds right so like i guess a classic example of this is like is a whale of fish in some biblical texts uh whales are described as fish or things that seem like they probably are whales but like now we kind of think that that's like the wrong way to define what a fish is or at least some people think that yeah uh so a few notes on this first it is totally compatible with the hypothesis that at least some human concepts some of the time would not match the natural boundaries that's like okay it doesn't have to be 100 of the time thing like it is entirely possible that people just like missed somehow that said whale being a fish would not be an example of that the telltale sign of that would be like people are using the same word and like whenever they try to talk to each other they just get really confused because nobody's nobody like quite has the same concept there right that would be a sign that humans just like are using a word that doesn't have a corresponding natural abstraction okay next thing there's the general philos philosophical idea that words point to clusters in thin space and that still carries over to natural abstractions can you say what it what a cluster in thing spaces uh sure so the idea here is you've got like a bunch of objects that are sort of like similar to each other so like we clustered them together we treat these as like instances of the cluster but you're still going to have lots of stuff out on the boundaries it's not like you can come along and draw a nice clean dividing line between like what is a fish and what is not a fish there's going to be stuff out on the boundaries and that's fine we can still have well-defined words in the same sense mathematically that we can have well-defined clusters so like you can talk about the average fish or like the sort of shape of the fish cluster like what are the major dimensions along which fish vary and like how much do they vary along those dimensions and you can have unambiguous well-defined answers to questions like what are the main dimensions along which fish vary without necessarily having to have hard dividing lines between like what is a fish and what is not a fish all right so in the case of whales do you think that what happened is like humans somehow miss the natural abstraction or do you think they like the version of fish that includes like are you saying the whale is like a boundary thing that's like so a whale like clearly it shares a lot of properties with fish you can learn a lot of things about whales by looking at fish right like it is it is clearly at least somewhat in that cluster like it's it's not a central member of that cluster but it's definitely at least on the boundary yeah and it is also in the mammal cluster like it's it's at least on that boundary so like yeah it's it's a case like that so if that were true i feel like so biologists understand a lot about whales i'm told i assume and i hope i'm not just totally wrong about this but my understanding is that biologicals are not like is that biologists are not unsure about whether whales are fish no they're not like oh they're like oh they're basically fish so like once you get down once you start being able to read genetic code the phylogeny like the branching of the the evolutionary tree becomes like the natural way to classify things and it just makes way more sense to classify that things for lots of purposes than any other way and like if you're trying to think about say the shape of whale bodies then it is still going to be more useful to think of it as a fish than a mammal but like mostly what biologists are interested in is more like uh metabolic stuff or like what's going on in the genome and for that much broader variety of purposes is going to be much more useful to think of it as a as a mammal okay so if i type this back into like how abstractions work are you saying something like like the abstraction of fish is this kind of thing where whale was an edge case but like once you gain like more information it starts uh fitting more naturally into the mammal sorry i mean the right answer here is like things that do not always unambiguously belong to one category and not another and also the categories like the natural versions of these categories are not actually mutually exclusive when you're thinking about phylogenetic trees it is useful to think of them as mutually exclusive but like the underlying concepts are not all right cool yeah i guess digging in a little bit when you say abstraction is like information that's preserved over a distance like should i think of the underlying thing is like there's probabilistic things and there's some probabilistic dependence and there's a lot of probabilistic independence or it's like so is it just like based on probabilistic stuff yep so the current formulation is all about like probabilistic conditional independence right like you have some information propagating uh conditional on that information everything else is independent that said there are known problems with this for instance this is really bad for handling mathematical abstractions like what is a group or what is what is a field right these yeah yeah these are clearly very natural abstractions and this framework does not handle that at all and i i expect that this framework captures sort of like the fundamental concepts in such a way that like it will clearly generalize once we have the the language for talking about like more matthews type things but like it's it's clearly not the whole answer yet when you say once we have the language for talking about more mathy type things what what do you mean there probably category theory but like nobody seems to speak that language particularly well so so if we understood like the the true nature of maths then we could understand like how it's related to the like probably if you had a sufficiently deep understanding of category theory and could take like the probabilistic formulation of abstraction that i have and express it in category theory then it would clearly generalize to all this other stuff in nice ways but i'm not sure any person currently alive understands category theory well enough for that to work how well would they like some people write textbooks about category theory right yeah is that not well enough almost all of them are trash well some of them aren't apparently i mean the the best category theory textbooks i have seen are solidly like decent but like they're still i i don't think i have found anyone who like has a really good intuition for like category theoretic concepts there are lots of people who can like crank the algebra and like they can like use the algebra to say some things in some narrow contexts but i don't really know anyone who can go like i don't know look at a giraffe and talk about the giraffe directly in category theoretic terms and say things which are non-trivial and actually useful mostly they can say very simple stupid things which is a sign that you haven't really deeply understood the language yet okay so there's this question of how does it apply to math i also have this question of how it applies to physics so some well i believe that physics is actually deterministic uh which you might believe if you were into many worlds quantum mechanics or at least it seems conceivable that physics could have been deterministic and yet you know maybe we would still have stuff that looks kind of like this yeah so in the physics case it's we're kind of on easy mode because chaos is a thing okay so for instance a a classic example here is a billiard system so like you've got a bunch of little hard balls bouncing around on a table yeah inelastically sorry perfectly elastically so like their energy isn't lost and the thing is every time there's a collision and then the balls roll for a while and there's another collision uh it basically doubles the the noise in the angle of a ball roughly speaking so like the the noise grows exponentially over time and you very quickly become maximally uncertain about the positions of all the balls when you say someone's uncertain about the positions of all the bulls like why so meaning if you have any any uncertainty at all in their initial conditions that very quickly gets amplified into being completely uncertain about where they ended up but why would no idea why would anything within this universe have uncertainty about their initial conditions given that it's a deterministic world and everything is like you can infer everything for every from everything else well you can't infer everything from anything uh everything else i mean why not get a billiards go look at a billiards table and like try to measure the balls and see how precise you can get their positions and velocities yeah what's going on there like why like your measurement tools in fact have limited precision and therefore you will not have arbitrarily precise knowledge of those those states so when you say my measurement tools have limited precision in a deterministic world it's not because there's like random noise that's washing out the signal or anything so yeah what do you mean my measurement tools have i mean you know take a ruler look at look at the lines on the ruler they're only so far apart and like your eyes like if you if you use your eye you can get like another tenth of a millimeter out of that ruler but it's still you can only you can only measure so so small right so it's a claim something like there's chaos in the billiard balls where like you know if you change the initial conditions of the billiard balls very slightly that makes a big change to where the billiard balls are later on but like i is a physical system like maybe there are a whole bunch of different initial states of the billiard bolts that get mapped to the same mental state of me yep and that's why uncertainty is kicking in yep you clearly like you will not be able to tell from your personal measurement tools like your eyes and your rulers whether the ball is right here or whether the ball is a nanometer to the side of right here and then once those balls start rolling around very quickly it's going to matter whether it was right here or an animator to the side okay so and so the story here is something like like the way probabilities come into the picture at all but the way uncertainty comes into the picture at all is because like things are deterministic but there's like many to one functions and you're trying to invert them as a human and that's why there's things being messed up so the problem with yeah i don't i don't want to put you into the deep end of philosophy of physics or too unnecessarily but like suppose i believe in like many worlds quantum mechanics and the reason you should suppose that is that i do um in many real in many worlds quantum mechanics stuff is deterministic sorry but do you really leave it ah i think so so in many worlds quantum mechanics there's no such thing as like many-to-one functions in terms of physical evolution right like the jargon is that the time evolution is unitary which means that like two different initial states always turn into two different subsequent states this is true in chaotic systems like there are lots of chaotic systems with this property okay so what's going on with this idea that there's some many to one function that i'm trying to invert that well that's why i'm uncertain the many to one function is the function from the state of the system to your observations it's not the function from state of system to future state of system okay but like i'm also part of the physical system of the world right yep yes you are so why isn't why isn't the chaos meaning that i have a really good measurement of the initial state like when you say there's chaos right that means the billiard bulls are doing an amazing job of measuring the initial state of the billiard balls right some sense yeah why aren't why aren't you doing an amazing job well uh you could certainly imagine that you just like somehow accidentally start out with a perfect measurement of the billiard ball state and you know that that would be a coherent world that that could happen but it gets better we observe that you do not do that okay yeah what's going on empirically we observe that you have uh only very finite information about these billiard balls like your the mutual information between you and the billiard balls is pretty small you could actually uh if i want to be sorry what do you mean the mutual information so in a deterministic world what do you mean the mutual information is very small so empirically you could like make predictions about what's going on with these billiard balls or like guess where they are when they're starting out and then also have someone else take some very precise like nanometer measurements and see how well your guesses about these billiard balls mapped to those nanometer precision measurements and then because you're making predictions about them you can you know crank the math on a mutual information formula assuming your predictions are coherent enough to imply probabilities if they're not even doing that then you're like in stupid land all right and your predictions are just totally but like yeah you can get some mutual information out of that and then you can quantify uh how much information does daniel have about these billiard balls okay like in the pre-bayesian days this was like how people thought about it right like yeah i guess i'm still not so okay i totally believe that that would happen right but but i guess the question is like how can that be what's happening in a deterministic world that i still don't totally get i mean what's what's what's missing what's missing is that if so we have this deterministic world a whole bunch of stuff is chaotic right yep what chaos means is that everything everything is measuring everything else very very well no it does not mean that everything is measuring everything else very very well what it means is that everything is measuring some very particular things very very well so for instance well okay not for instance uh conceptually what happens in chaos is that the current macroscopic state starts to depend on like less and less significant bits in the initial state yeah so like if you just had a simple one-dimensional system that starts out with this real number state and you're looking at the first 10 bits in in time step one and then you look at the first ten bits in time step two well the first ten bits in time step two are going to depend on the first eleven bits in time step one okay and then in time step three it's gonna depend on the first 12 bits in time step one right so yeah depending on more bits further and further back in that expansion over time right yep this does not mean that everything is depending on everything else and in particular if the if like the bits and time step 100 are depending on the first 100 bits you still only have 10 bits that are depending so like there's only so much information about 100 bits that 10 bits can contain right like you have to be collapsing it down an awful lot somehow right and even if it's collapsing according to like some random function random functions have a lot of really simple structure to them yeah so like one one particular way to imagine this is like maybe you're just xoring those first hundred bits to get bit one right and if you're just xoring them then that's extremely simple structure in the sense that like flipping any one of them flips the flips the output yeah random random functions end up working sort of like that if you you you only need to flip like a handful of them in order to flip the output for any given setting right all right and now i have forgotten how i got down that tangent what was the original question the original question is in a deterministic world how does any of this probabilistic inference happen given that it seems like my physical state sort of has to like there's no randomness where my physical state can only lossly encode the physical state of some system that i'm interest some other subsystem of the real universe that i'm interested in say that again so if if the real world isn't random right then like every physical system like there's only one way it could have been right for some value of could so it's not obvious why anybody has to do any probabilistic inference right i mean like given the initial conditions there's only one way the universe could be but like you don't know all the initial conditions you can't know all the initial conditions because you're embedded in the system so are you saying something like the initial conditions like like there are 50 bits in the initial conditions and i have a simple function of the initial conditions such that the initial conditions could have been different but i would have been the same uh let's forget about bits for a minute and talk about like let's say let's say atoms we'll say we're in a classical world daniel is made of atoms how many atoms are you made of uh i guess a couple moles yeah which is probably a bit more than that i would guess like tens of moles okay so let's say adam sorry daniel they have thinking tens of moles of atoms uh so to describe daniel's initials like state at some time we'll take that to be the initial state daniel's initial state consists of state of uh some tens of moles of atoms state of the rest of the universe consists of the states of an awful lot of moles of atoms i don't actually even know how many orders of magnitude sure but like unless there's like some extremely simple structure in all the rest of the universe the states of those uh tens of moles comprising you are not going to be able to encode both their own states and the states of everything else in the universe make sense that seems right i do think like isn't there this thing in thermodynamics that the universe had a low entropy initial state uh doesn't that mean that's easy to encode okay well uh we can move on from there uh i think we spent a bit of time that rabbit hole the summary is something like abstraction the way it works is that like if i'm far away from some subsystem there's like only a few bits which are reliably preserved about that subsystem that ever reach me and that's like everything else is like shook out by noise or whatever and that's what's going on with abstraction those few bits those are like the abstractions of the system yep okay a thing which you've also talked about as like a different view on abstraction is the generalized cope and pitman darmoir theorem yeah i'm saying that right can you tell us a little bit about that and how it relates all right so the original coupeman-pitman-darmwa theorem was about sufficient statistics right so basically you have a bunch of in uh iid measurements all right sorry so independent measurements and each of them like takes the same random distribution so we're like measuring the same thing over and over again and we're supposing that there exists a sufficient statistic which means that like we can aggregate all of these measurements into some like fixed dimensional thing no matter how many measurements we take we can aggregate them all into this one aggregate of like some limited dimension and that will that will summarize all of the information about these measurements which is relevant to our estimates of the thing right so for instance if you have normally distributed noise in your measurements then a sufficient statistic would be the mean and standard deviation of your measurements all right and you can just like aggregate that over all your measurements and that's like a two dimensional or n squared plus one over two dimensional if you if you have like n dimensions something like that uh but it's like this fixed dimensional summary right all right and you can see how conceptually this is a lot like the abstraction thing like i've got this chunk of the world like there's this water bottle here this is a chunk of the world uh and i'm summarizing all of the stuff about that water bottle that's relevant to me in like a few relatively small dimensions right so conceptually you might expect these two to be you know somewhat closely tied to each other yeah the actual claim that koopman pitman darma makes is that these sufficient statistics only exists for exponential family distributions also called maximum entropy distributions so this is this really nice class of distributions which are very convenient to work with like most most of the distributions we work with in statistics or in statistical mechanics are exponential family distributions and it's this hint that in a world where natural abstractions works we should be like using exponential family distributions for for like everything right okay now the the general the the reason they need to be generalized is because the original pitman darma theorem only applies when you have these repeated independent measurements of the same thing right so what i wanted to do was generalize that to for instance a big causal graph right like the sort of model that i'm using for for worlds in my work normally and that turned out to basically work uh the theorem does does indeed generalize quite a bit uh there are like some some terms and conditions there but yeah basically worked so the upshot of this is basically we should be using exponential family distributions for abstraction which makes all sorts of sense like if you go look at statistical mechanics this is exactly what we do we've got like the bolt's bond distribution for instance uh it's an exponential family distribution and in particular that that's because like um exponential family distributions are maximum entropy which means that they're like as uncertain as you can be subject to some constraints like you know the average energy yes now there's there's a key thing there the interesting question with exponential family distribution like why do we why do we see them pop up so so so often uh the maximum entropy part makes sense right that's that part very sensible the weird thing is that it's a very specific kind of constraint the constraints are expected value constraints in particular like the expected value of my measurement is equal to something right yeah or some function of my measurement as well yes but it's always the expected value of whatever the function is of the measurement and then the question is like why this particular kind of constraint there there are lots of other kinds of constraints we could imagine lots of other kinds of constraints we could dream up and yet we keep seeing distributions with that are that are maxentropic subject to this particular kind of constraint and then koopman pittman armoire is saying well these things that are maxitropic subject to this particular kind of constraint are like the only things that have these nice summary statistics the only things that abstract well so not really answering the question but like it sure is putting a big old spotlight on the question isn't it okay and so if there's some yeah it's almost saying that like somehow like the the summary statistics are the things that are like being transmitted somehow they're i think the only the like minimal things which will tell you anything else about the distribution that's exactly right so so it works out that the the summary statistics are basically the things which will propagate over a long distance okay another another way you can think about the models is you have like the high level concept of this water bottle you have some summary statistics for the water bottle and all the low level details of the water bottle are approximately independent given the summary statistics okay the last question i wanted to ask here yeah just just going off the thing you said earlier about abstractions as like being related to what people care about and impact measures and stuff so there's this like natural abstraction hypothesis which has like a wide class of agents will use um the same kinds of abstractions um also in the world of impact measures um a colleague of mine called alex turner has worked on attainable utility preservation which is this idea that like you just look at changes to the world that like a wide variety of agents will care about for something and that's impact that will basically change a wide variety of agents ability to achieve goals in the world yep i'm wondering like yeah are these just like superficially similar or is there like these these are these are pretty directly related so if you're thinking about like information propagating far away uh anything that's optimizing stuff far away is mainly going to care about that information that's propagating far away right yeah and if you're in a reasonably large world most stuff is far away so like most utility functions are mostly going to be optimizing for stuff far away okay so if you average over a whole bunch of utility functions it's naturally going to be pulling out those those natural abstractions okay cool all right so i guess yeah going off our earlier discussion of impact measures i like to talk about value learning and alignment so and speaking of like people being confused about stuff how confused do you think we are about alignment how confusing uh what narrow down your question here what are you asking yeah what do you think we don't know about alignment that you wish we did look man i don't even know which things we don't know that's how confused we are okay if we knew which specific things we didn't know we would be in way better shape all right well okay here here's something that someone might think here's the problem of ai alignment like humans have some preferences there's like ways the world can be and we think some of them are better there are some of them can worse and some of them are worse and we want ai systems to pick up that ordering and basically like try to get worlds that are high in that ordering by doing agency stuff from solved uh what's i'm saying do you think there's anything wrong with the part where the problem was solved uh yeah i said what we wanted to do you didn't say anything at all about how we'd do it okay i mean even with the want to do part i'd say there's probably some minor things that disagree with there but it sounds like a basically okay statement of the problem okay so yeah i guess well i don't know people can just tell you what things they value and people do this right now man have you like talked to a human recently they cannot tell me what things that like i ask them what they like and they have no include excuse me they have no clue what they like i talk to people and they're like ah i'm like what's your favorite food and they're like ah kale salad and i'm like extremely skeptical that sure sounds like the sort of thing that someone would say if their idea of what food was good was strongly optimized for social signaling and they were paying no attention whatsoever to the enjoyment of the food in the moment and people do this all the time this is this is how the human mind works people have no idea what they like i mean so have you had a girlfriend did your girlfriend reliably know what she liked because my girlfriend certainly does not reliably know what she likes well um let's so okay if we wanted to understand like the the innermost parts of the human psyche or something then i i think this would be a problem but it seems like we want to create really smart ai systems to do stuff for us and like the stuff doesn't have to be like i don't know sometimes people are really worried that um even if i wanted to create the best theorem improver in the world that just like uh you know could walk around like prove some theorems and then like solve all mathematics for me um people act like that's a problem and like i don't know how hard is it for me to be like yeah please solve mathematics please don't murder that dog or whatever like is that is that all that hard i mean the the alignment problems for a theorem prover are relatively mild the problem is you can't do very many interesting things with the theorem improver okay what about let's say a material scientist all right here's what i want to do i like i want to create an ai system in a way better than any human i can just like describe properties of some kind of material that i want okay and the ai system like you know it figures out like how i do chemistry to create a material that gets me what i want yeah okay now we're getting into like at least mildly dangerous land the big question is how rare a material are you looking for like if you're asking it a relatively easy problem then this will probably not be too bad if you're asking it for a material that's extremely complicated and like very very it's very difficult to get a material with these properties then you're potentially into the into the region where like you end up having to engineer biological systems or like build nano systems in order to get the material that you want and when your system is spitting out nano machine designs then you got to be a little more worried okay so so when when i was talking about like the easy path to alignment where you just say what outcomes you do and don't want why does the easy path to alignment fail here oh god that that path fails for so many reasons all right so first there's there's like my initial reaction about like humans in fact have no idea what they want the things they say they want are not a good proxy for the things they want even in the context of material science in the context of material science eh so like the sort of problem you run into there is this like unstated parts of the values thing like a human will say ah i just want a material with such and such properties but what they actually want is or like they they say i want you to build me give me a way to make the material with such and such properties right like maybe the thing can spit out a specification for such material but that doesn't mean you have any idea how to build it so you're like all right tell me how to build a material with these properties right and what you in fact wanted was for the system to tell you how to build something with those properties and also not turn the entire earth into solar panels in order to collect energy to build that stuff okay but you probably didn't think to specify that and that's just like an infinite list of stuff like that that you would have to specify in order to capture all of the things which you actually want you're not going to be able to just like say them all sure so it seems like if we could somehow express don't mess up the world in a crazy way how much of like what you think of us the alignment problem or the value learning problem right so that gets us to one of the next problems which is so first of all we don't currently have a way that we know to robustly say don't mess up the rest of the world but i expect that that part is like relatively tractable like turner's stuff about average utility preservation i think would more or less work for that except for the part where again it can't really do anything that interesting the the problem with a machine which only does low impact things is that you will probably not be able to get it to do high impact things and we do sometimes want to do high impact things okay yeah can you describe a task where like here's this task that we would want an ai system to do and we need to do like really good value learning or alignment or something even beyond that well let's just go with your material example so we wanted to give us some material with very bizarre properties these properties are bizarre enough that uh you need nano systems to build them and it takes stupidly large amounts of energy so if you are asking an ai to build this stuff then by default it's going to be doing things to get stupidly large amounts of energy which is probably going to be pretty big and destructive if you ask a low-impact ai system to build this to like produce some of this stuff what it will do is mostly try to make as little of it as possible because it will have to do some big impact things to make large quantities of it yeah so it will mostly just like try to avoid making the stuff as much as possible yeah i mean i mean you could imagine we just like ramp up the impact budget like bit by bit until we get enough of the material we want and then we definitely stop well then the question is how much material do you want and what do you want it for like presumably what you want to do with this material is like go do things in the world like build new systems in the world that do cool things the problem is that is itself high impact you are changing the world in big ways when you go do cool things with this new material all right like if you're creating new industries off of it that's a huge impact and your low impact machine is going to be actively trying to prevent you from doing that because it is trying to have low impact cool so before you you've said that um you're interested in solving like ambitious value learning where we just understand like everything about humans preferences not just like don't mess things up too much so this is why right out the gate i'll clarify an important thing there which is i'm not like i'm perfectly happy with things other than ambitious value learning like that is not my exclusive target i think it is a useful thing to aim for because it sort of most directly forces you to address the hard problems but like there there's certainly other things we could hit on along the way that would be fine like if if it turns out that corridability is actually a thing that would be great what's corridability if it's a thing go look it up all right fine daniel you can tell people what courage ability is okay so something like getting an ai system to like be willing to be corrected by you and let let you like edit its desires and stuff yeah like it's it's trying to help you out without necessarily influencing you it is trying to like help you get what you want yeah it's it's trying to like be nice and helpful without like pushing you okay and so yeah the reason to work on ambitious value learning is just like there's nothing we're sweeping under the rug yeah part of the problem when you're trying for like corrigibility or something is that it will be it makes it a little too easy to like delude yourself into thinking you're solving the problem when in fact you're ignoring big hard things whereas with uh ambitious value learning like there's a reason ambitious is right there in the title it is very clear that there are a bunch of hard things and they're they're broadly the same hard things that everything else has it's just sort of more obvious all right and so how does your work relate to ambitious value learning is it something like uh to figure out what abstractions are than like something profit uh so it ties in in multiple places one of them is by and large humans care about high-level abstract objects not about low-level quantum fields so if you want to understand human values then like understanding those abstractions is like a clear useful step right similarly if we buy the natural abstraction hypothesis then that like really narrows down which things we might care about right so there's that angle uh another angle is just using abstraction as a foundation for talking about agency more broadly okay another value of what abstraction is for or another value of how we of course usually learning of what abstraction is okay cool so like understanding better understanding agency and what goals are in general and like how to look at a physical system and figure out what his goals are like that's abstraction is a useful building block with to use for building those models okay cool isn't type signature of human values world to reals what sloppy what person isn't the type signature of human values even if we're taking the pure you know bayesian expected utility maximizer viewpoint is an expected utility maximizer that does not mean that your inputs are worlds that means your inputs are human models of world like it is the random variables in your model that are the inputs to the utility function not whole worlds themselves you do not like the world was made of quantum fields long before human models had any quantum fields in them right those clearly cannot be the inputs to to a human value function well i mean an expected utility sorry uh in in the symbol expected utility world if if i have an ai system that i think has like better probabilities than i do then like uh if it has the same like world to reals function then like i'm happy to defer to it right what's a world to realize how does it have a world and it's in its uh in its world model like a world model the ontology of the world model is not the ontology of the world the variables in the world model do not necessarily correspond to anything in particular in the world well i mean they they kind of correspond to things by the natural abstraction hypothesis that would give you a reason to expect that some variables in the world model correspond to some things in the world but like that's not a thing you get for free just from like having expected utility maximizer all right so next i want to talk a little bit about how you interact with the research landscape okay so you're an independent researcher right i'm wondering like how yeah which parts of ai alignment do you think fit best with independent research and which parts like don't fit with it very well so the right now de facto answer is that like most of the substantive research in alm that fits best with independent research like academia is mostly going to steer you towards like doing sort of bulky things that are easier to publish and the there's like what three orgs in the space uh miri is kind of on ice at the moment redwood is doing like mildly interesting empirical things but not even attempting to tackle core problems uh anthropic is doing some mildly interesting things but not even attempting to tackle court problems like if you want to tackle the core problems then independent research is clearly the way to go right now that does not mean that like there's structural reasons for independent research to be the right sort of route for this sort of thing so like as as time goes forward sooner or later the field is going to become more paradigmatic and the more that happens the less independent research is going to be like structurally favored uh but even now there are like parts of the problem where you can do paradox more paradigmatic things and there working at an organization makes more sense all right and and so when you say like this this interest between core problems that maybe independent research is best and non-core problems that other people are working on what's that division in your head so the the core problems are things like uh sort of like the conceptual stuff we've been talking about like right now we don't even know what are the key questions to ask about alignment we still don't really understand what agency is mathematically or any of the like adjacent concepts within that cluster uh and as long as we don't know any of those things we don't really have any idea how to like robustly design an ai that will do a thing we want right all right we're at the stage where we don't know which questions we need to ask in in order to do that and when i talk about tackling the core problems i'm mostly talking about like research directions which will plausibly get us to the point where we know which questions we need to ask or which things we even need to do in order to robustly build an ai which does what we want all right so yeah one one thing here is that it seems like you've ended up with this outlook especially on the on the question of like what the core problems are that's relatively close to that of uh mary yep the machine intelligence research institute uh you don't work at mary my understanding is that you didn't like grow up there in some sense what is your relationship to that thought world so i've definitely had like i was exposed to the sequences back in back in college 10 years ago and i followed mary's work for for a number of years after that uh i went to the miri summer fellows program 2019 so i've had like a fair bit of exposure to it it's not like it's not like i was just evolved completely independently but it's still like even even correct like there's lots of other people who have followed mary to about the same extent that i have and have not converged to similar views to nearly the same extent right including plenty of merry employees even it's a very heterogeneous organization all right so what do you mean by mary views if mary's such a heterogeneous organization i mean i i think when people say mere reviews they're mostly talking about like uh yudkowski's views uh nate's views those those pretty heavily overlap a few other people the the agent foundations team is relatively similar to those uh yeah so a lot of people have followed mary's output and read the sequences but didn't end up with uh mary's fees including many people in mary yeah i assume that's because they're they're all defective somehow and do not understand things at all i don't know man i'm being sarcastic all right yeah we can't daniel daniel you can see that from my face oh but yeah uh i think i so part of this i think is about sort of how i came to the problem like i came through these sort of biology and economics problems which were actually let me back up talk about how other people come to the problem so i think a lot of people start out with like just just starting at the top like how do we align in ai how do we get an ai to do things that we want robustly right and if you're starting from there you have to like play down through a few layers of the game tree before you start to realize what the main bottlenecks to solving the problem are whereas i was i was coming from this this different direction from like biology and economics where i had already like gone a layer or two down those game trees and seen what those hard problems were so when i was looking at ai i was like ah this is clearly going to run into the the same hard bottlenecks right right it's it's mostly about like going down the game tree deep enough to see what those key bottlenecks are okay and you think that like uh yeah somehow people who didn't end up with this these views do you think they like went down a different leg of the tree or i think most of them just like haven't gone down that many layers of the tree yet like most people in this field are still pretty new in absolute terms and it does take time to like play down the game tree uh and i do think the longer people are in the field the more they tend to converge to a similar view so for instance example of that uh right now uh myself scott garabrant and paul cristiano are all working on basically the same problem all right it's it's we're all basically working on what is abstraction or like what where does the human ontology come from that sort of thing sure and that was very much a case of convergent evolution we all came from extremely different directions all right yeah speaking of these adjacent fields i'm wondering so you mentioned uh biology and economics are there any other ones that are like sources of inspiration uh so biology economics and air ml are the big three okay i know some people draw similar kinds of inspiration from neuroscience i personally haven't spent that much time looking into neuroscience but like i it's certainly an area where you can end up with similar intuitions and also complex systems theorists run into similar things to some extent okay so i expect that a lot of listeners to the show will be like pretty familiar with ai maybe less familiar with um biology or economics who are the people or what are the like lines of research that you think are really worth following there so i don't think there's no one that immediately jumps out as like the the person in economics or in biology who is like clearly asking the analogous questions to what we're doing here okay i would say yeah no i don't i don't have like a clear good person like there are people who do like good work in terms of understanding the systems but it's not really about like fundamentally getting what's going on with agency okay another okay i'm gonna wrap back a bit and another question about uh being an independent researcher in non-independent research there's like you know it's run by organizations and the organizations are thinking about creating like useful infrastructure to have their researchers do good stuff i imagine this might be deficit in the independent research landscape so firstly what existing infrastructure like do you think is pretty useful and what infrastructure like could potentially exist that would be super useful yeah so the obvious big ones currently are less wrong slash the alignment forum uh and the light cone offices and now actually also the constellation offices to some extent which are offices in berkeley where some people work yep so that's obviously hugely valuable infrastructure for me personally things that could exist the the big bottleneck right now is just like getting more independent researchers to the point where they can do useful independent research and alignment like there's a lot of people who'd like to do this but they don't really know what to do or where to start if i'm thinking more about like what would be useful for me personally there's definitely a lot of space to have like people focused on distillation like full-time figuring out what's going on with uh various researchers ideas and trying to write them up in in more easily communicated forms so a big part of uh my own success has been like being able to do that pretty well myself but even then it's still pretty time intensive and certainly there are other researchers who are not good at doing it themselves and it's helpful both for them and for me as someone reading their work to have somebody else come along and you know more clearly explain what's going on sure so going back a little bit to um upcoming researchers who like uh trying to figure out how to get to a place where they can do useful stuff yep concretely what do you think is needed there yeah that's that's something i've been working on a fair bit lately is trying to figure out what exactly is needed so there's a lot of currently not very legible skills that go into uh doing this sort of pre-paradigmatic research obviously the problem with them not being legible is like i can't necessarily give you a very good explanation of what they are right yeah yeah to some extent there are ways of like getting around that like if you're working closely with someone who has these skills then you can sort of pick them up i wrote a post recently arguing that a big reason to study physics is that physicists in particular seem to have a bunch of illegible epistemic skills which somehow get passed on to new physicists but like nobody really seems to make them legible along the way and then of course there's like people like me trying to figure out what some of these skills are and just directly make them more legible yeah so for instance i was working with my apprentice recently just like doing some math and i asked him at one point i paused and asked him uh all right sketch for me the picture in your head that corresponds to that math you just wrote down and he was like wait picture in my head what and i'm like ah that's an important skill we're gonna have to install that one like when you're doing some math you want to be act like have a picture in your head of like the prototypical example of the thing that the math is saying that's like a very like crucial load-bearing skill yeah so then like did a few exercises on that and like within a week or two there was a very noticeable improvement as a result of that right okay but that's an example of this sort of thing but sure it's it's not it's not very legible but like it's it's extremely important for being able to do the work well all right another question i'd like to ask is about the field of ai alignment i'm wondering at some point i believe you've said that you think that at some point in like five to ten years it's gonna kind of get its act together there's gonna be some kind of phase transition yeah where things get easier yeah can you talk a little bit about like why you think that will happen and what's going on the obvious the obvious starting point here is that i'm trying to make it happen so there's it's it's an interesting problem trying to like like set the foundations for a paradigm right for paradigmatic work it's an interesting problem because you have to kind of like play the game at two separate levels there there's a technical component where you have to have the right technical results in order to to to like support this kind of work but then at the same time you want to be pursuing the kinds of technical results which will provide legible ways for lots of other people to contribute so you want to kind of be optimizing for both of those things simultaneously and this is for instance uh the selection theorems thing despite the bad marketing uh this was exactly what it was aiming for and i've already seen some success with that like i'm currently mentoring an ai safety camp team which is working on the the modularity thing okay and it's going extremely well like they have a very they've been they've been turning around real fast they have a great iterative feedback loop going where like they they have some idea of how to formulate modularity or what sort of modularity they would see or why they go test it in an actual neural network it inevitably fails the theorists go back to the drawing board and it's a clear idea of what they're aiming for there's a clear idea of what success looks like a reasonably clear idea of what success looks like and they're able to like work on it very very quickly and effectively that said i like i don't think the selection theorems thing is actually going to be the what makes this phase change happen longer term but it's sort of a a stop gap i guess so so what like like is the idea that um the natural abstraction hypothesis gets solved and then what happens so that would be one path there's more than one possible path here but like if the natural abstraction hypothesis is is solved real well you could imagine that we have a legible idea of what good abstractions are so then we can tell people like go find some good abstractions for x y z right agency or optimization or world models or information or what have you right if we have a legible notion of what abstractions are then it's much more clear how to go looking for those things what the criteria are for success whether you've actually found the thing or not and those are the key pieces you need in order for lots of people to go tackle the problem independently right those are the foundational things for a paradigm cool yeah i guess wrapping up if people listen to this podcast and they're interested in you and your work yeah how can they follow your writings and stuff they should go unless wrong all right and how do they find you go to the search bar and type john s wentworth into the search bar alternatively you can just look at the front page and i will probably have a post on the front page at any given time all right we'll say john s wentworth as the author cool cool well thanks for joining me on the straight thank you and uh to the listeners i hope this was valuable this episode is edited by jack garrett the opening and closing themes are also by jack garrett the financial costs of making this episode are covered by a grant from the long-term feature fund through the transcript of this episode or to learn how to support the podcast you can visit axerp.net finally if you have any feedback about this podcast you can email me at feedback accerp.net [Music] [Laughter] do [Laughter] [Music] you

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs