Library / In focus
AXRPCivilisational risk and strategy
Infra-Bayesian Physicalism with Vanessa Kosoy

Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from AXRP. Editorial summary pending review.
Perspective map
MixedGovernanceMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 85 full-transcript segments: median 0 · mean -2 · spread -18–0 (p10–p90 -10–0) · 1% risk-forward, 99% mixed, 0% opportunity-forward slices.
Slice bands
85 slices · p10–p90 -10–0
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- Emphasizes safety
- Emphasizes ai safety
- Full transcript scored in 85 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
ai-safetyaxrp
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video XyXn7Hj2oYc · stored Apr 2, 2026 · 2,993 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/infra-bayesian-physicalism-with-vanessa-kosoy.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody today i'm going to be talking with vanessa kassoy she is a research associate of the machine intelligence research institute and she's worked for over 15 years in software engineering about seven years ago she started ai alignment research and is now doing that full time back in episode five she was on the show to talk about a sequence of posts introducing informationism but today we're going to be talking about her recent post information physicalism a formal theory of naturalized induction co-authored with alex of hell for links to what we're discussing you can check the description of this episode and you can read the transcript at axrp.net vanessa welcome to excerpt uh thank you for uh inviting me cool so this episode is i guess about uh inforbation physicalism can you remind us of the basics of just what informationism is yes so informationism is a theory we came up with to solve the problem of non-realizability which is how to do theoretical analysis of reinforcement learning algorithms in situations where you cannot assume that the environment is in your hypothesis class which is something that has not been studied much in the literature for reinforcement learning specifically and the way we approach this is by bringing in concepts from so-called imprecise probability theory which is something that's mostly decision theories that economists has been using and the basic idea is instead of thinking of a probability distribution you could be working with a convex set of probability distributions that's what's called the cradle set in imprecise probability theory and then when you're making decisions instead of just maximizing the expected value of your utility function with respect to some probability distribution you're maximizing the minimal expected value where you minimize over the set so that's like as if you imagine an adversary is selecting some distribution out of the set the nice thing about it is that you can start with this basic idea and on the one hand construct an entire theory analogous to classical probability theory and then you can also use this to construct the theory of reinforcement learning and generalize various concepts like regret bonds that exist in classical reinforcement learning or like markov decision processes you can generalize all of those concepts to this imprecise setting cool so my understanding of the contributions at least to what informationism is that you made was basically that it was a way of combining imprecise probability with an update rule with like sequential decision me making in some kind of coherent way so like today what i would want to do tomorrow if i learned that the sun was shining once i wake up tomorrow if the sun is actually shining i still want to do that thing is that like roughly a good way of explaining what the contribution is at least in those original posts well there are several aspects there like what aspect is the update rule here i have to admit that an equivalent object rule has already been considered in a paper by gilboa and schmidler but they described it as partial orders over policies they did not describe it in the mathematical language which we used which was of those convex sets of so-called a measures and the duo form where we use concave functionals on the space of functions they did not have that so we did contribute something to the other true and the other thing is combining all of this with concepts from reinforcement learning such as regret bonds and seeing that all of this makes sense and the mark of decision processes and so on and the first thing was applying this to newcombian kind of newcomer products type situations to show that we are actually getting behavior that's more or less equivalent to so-called functional decision theory cool one kind of basic question i have about this so last time we talked about this idea of like you know every day you wake up and like there's some bit you get to see it's zero or one right and you mentioned that like one thing you can do with informationism is you can have like some hypothesis about even bits but say like you don't know anything about the distribution of the odd bits or like how they relate to each other and i was wondering if you could give us some sense of like we have this like convex probability distribution you know maybe over the odd bits what would we lose if we just averaged over that convex set like what kind of behavior can we get with informationism that we couldn't get normally well there are several problems with this like one one thing is that there's a technical problem that if your space is infinite dimensional then it's very unclear what does it even mean to average over the convex set but that's more of a technicality the real problem is that any kind of average brings you basically back into the bayesian setting and in the bayesian setting you only have guarantees for environments that match something in your prior so yeah so only if you're like absolutely continuously respected or prior then you can have some kind of guarantees or sometimes you can have some guarantees under some kind of an ergodicity assumption but in general it's very hard to guarantee anything if like you know if you you're just have some arbitrary environment here we can have some guarantee like it's enough to for the environment to be somewhere inside the convex set that we can guarantee that the agent gets at least that much expected utility okay so putting that concretely just to see if i understand one sort of bayesian thing i could do in the evening bits scenario is i could say that like on all the odd bits i have like i'm 50 50 on whether the bit will be zero one and it's independent and identically distributed on every odd day whereas like i could also have some sort of information mixture or you know the set of like distributions that could happen um on the odd days and it seems like in terms of absolute continuity one thing that could go wrong is it could be the case that actually like on the odd days one-thirds of the bits are one and two-thirds of the bits are zero and that wouldn't be absolutely continuous with respect to my bayesian prior but it might still be in the convex set in the information setting am i understanding that correctly is that like a good example yeah like if your prior just does not have like you can assume that the things you don't know are distributed uniform but in reality they are not destroyed uniform they are distributed in some other way and then you're just not going to like what happens is that uh suppose you have set several hypotheses about what what is going on with the even bits and one of them is correct and with the odd bits you don't have any hypothesis that matches what is correct then what happens with the bayesian update is that you don't even converge to the correct things on the even bits the fact that the odd bits are behaving in some way which is not captured by your prior causes you to fail to converge to correct beliefs about the even bits even though the even bits are behaving in some regular way all right cool with that clarification i've actually asked some people what they'd like me to ask in this interview and i think a lot of people have the experience of maybe they're putting off reading these posts because they seem a bit mathy and they're not sure like maybe what they would get out of reading them i was wondering if you could tit your own horn a little bit and say like what what kind of insights might there be or what can you say that might tempt people to really delve in and read about this well i think that information is is important because it is at the center of a lot of important conceptual questions about ai and about er alignment because in a lot of cases like the inability to understand how to deal with non-realizability it kind of causes challenges to negative attempts to come up with models of things and neucombian paradoxes is one example of that because like eucombian paradoxes are inherently non-realizable because you have this omega that simulates you so you cannot be simulating omega sure and just uh in case people haven't heard of that by neukomian paradoxes i guess you mean like situations where like you're making a decision but some really smart agent has like figured out what you would do ahead of time and has changed your environments in ways that are maybe unknown to you by the time you're making the decision is that right yeah so nucabian parallaxes means that your environment contains predictors which are able to predict your behavior either ahead of time or in some other counterfactual and do things that depend on death and that's one type of situation where you have to somehow deal with non-realizability and you know and things like just averaging over the convex set it just doesn't work another type of situation which is kind of related to that is multi-agent scenarios where you have several agents trying to make predictions about each other and that's also something where it's not possible for all agents to be realizable with respect to each other because that creates a sort of flu paradox and this of course has also implications for when you're thinking about agents that are trying to kind of self-modify or agents that are trying to construct smarter agents than themselves all sorts of things like this so like a lot of different questions you can ask kind of run up against problems that are caused by non-realizability and you need to have some tool some some framework to deal with this okay cool moving on a bit i suppose um it claims to be some kind of theory of naturalized induction can you first tell me like what is naturalized induction yes so naturalized induction is a name invented by miriam farzana to the problem of how do you come up with a framework for ai in which you don't have this cartesian boundary between the eye and its environment because like classical models or you can call it like the classical cybernetic model if you use ai as you have an agent and you have an environment and they're completely separate and the agent can take actions which influence the environment and the agent can see observations in which sense like the environment has some communication in the other direction but there is some clear boundary between them which is immutable and this does not take into account a bunch of things things like the fact that the agent was created at some point and something existed before that things like that the agent might be destroyed at some point and something different will exist afterwards or the agent can be modified which you know you can use a special case of getting destroyed but that's still something that you need to kind of contend with and it also brings in the problem with the need to have those so-called breach rules so it's like because you're working in this cybernetic framework the hypothesis your agent constructs about the world are they have to be phrased from the subjective point of view of the agent so they cannot be phrased in terms of some kind of bird's eye view or whatever and that means that the laws of physics are not available for to have a valid hypothesis you need to have something like the laws of physics plus rules how to translate between the degrees of freedom that the laws of physics are described within to the actual observations of the agent and that like makes the hypothesis much much more complex which creates a whole bunch of problems and why does it make it more complex so think about it is like we have a cams racer right which is telling us that we should look for simple explanations simple regularities in the world and of course against razor is kind of the foundation of science and rational reasoning in general but if you look at how physicists use occam's razor physicists they say okay here this is like a quantum cell theory of screens here or whatever this is this really simple and elegant theory that explains everything but the theory doesn't explain our observations directly right the theory is not phrased in terms of like what my eyes are going to see when i move my head like two degrees to the right or something they're kind of described in terms of some fundamental degrees of freedom quantum fields particles or whatever and the translation between quantum fields and what my eyes going to see where the camera of the eyes going to see or whatever that's something extremely complex and like it's not a simple and elegant equation that's something monstrous and people kind of have been trying to find frameworks to solve this well at least people have been a little bit trying so like the classical framework for ai in the cybernetic from the like cybernetic cartesian point of view is the so-called axon so there have been some attempts for example orson rink had a paper called something like space-time validation where they kind of try to extend this in some way which would account for those problems but that did not really take off like there are problems with their solution and they haven't really fall out much in that this is the process of the problem i'm trying to solve it okay can you give us a bit of a sense of why we should be excited for a solution to naturalized agency like can we just sort of approximate it as some cartesian thing and not worry too much well like i said the fact that we need to add a whole bunch of complexity to our hypothesis in fact if we need to add those bridge rules it creates a range of problems one problem is first of all this mere fact should makes us suspicious that something has gone deeply wrong because if our starting point was ocam's razor like that you know we're assuming that simple hypothesis should be correct and here the diabolos is not simple at all so it's like why is our simplicity bias even going to work and when we start analyzing it we see that yeah we run into problems because what you can do for experiments like what happens if you have an ai and someone throws a towel on the camera so at this particular moment of time all the pixels on the camera are black because the the towel is blocking all the line now the i can think and decide hmm what is the correct hypothesis is you know this hypothesis like i had this hypothesis before but maybe the correct hypothesis is actually if as long as not all the pixels are black this is what happens like or zero physics and at the moment all the pixels go black so something changes i don't know like defined structure constant becomes different yeah it could even become a simpler number yeah and from the perspective of like scientific reasoning this sounds completely crazy hypothesis which should be discarded immediately but from the perspective of a cartesian ei system this sounds very reasonable because it only increases the complexity by a very small amount because like this event of all the pixels becoming black it's a natural event from a subjective point of view it's like an event which very low description description complexity but from a point of view of physics it's a very complex event like from the point of view of physics when we're taking this kind of bird's eye view we're saying well there's nothing special about this camera and about this towel like the fact that some camera somewhere has black pixels that should not affect the fundamental laws of physics like that would be an extremely contrived modification of the laws of physics if that happens so we should assign like extremely low probability to this but the the cybernetic agent does not assign low probability to this so something is very weird about the way it's reasoning and this is one example another example is thinking about evolution you know like the theory of evolution explains why we exist or like you know we have theories about cosmology which are also part of the explanation of why we exist and those theories are kind of important in the way we reason about the world and they help us understand all sorts of things about the world but for the cartesian agent those questions don't make any sense because the cartesian agent defines everything in terms of its subjective experiences so things that happened before the agent existed or things that led to the agent to start existing that's just nonsense it's just not defined at all in the cartesian framework so those kinds of agencies are just unable kind of ontologically incapable of reasoning along those lines at all and that seems like an indication that something is going wrong with them okay and i was wondering so this problem of bridge rules first can you give us a sense like informally without going into the details roughly like how is information physicalism going to deal with this yes so the way we are going to deal with this is saying that we don't want our hypothesis to be describing the world from this subjective point of view from the subjective perspective of the agent we want our hypothesis to be describing the world from some kind of quote-unquote bird's-eye view and then the whole problem becomes okay but how do you deal with the translation between the bird's eye view and the subjective because the evidence that the agent actually has is the evidence of its senses so it's evidence that exists in the subjective frame and here the key idea is that we find some mathematical way to make sense of the agent needs to look for itself inside the universe so the agent has a hypothesis that kind of describes the universe from from this bird's eye view and then it says i'm an agent with such-and-such source code that i know and now i look into this universe and i'm searching for where inside this universe the source code is running and if the source code is running in a particular place and receiving particular inputs then those inputs are what i expect to see if i find myself in this universe and this is how i measure my evidence against this kind of bird's-eye view of my bodices okay cool i also just want to go over so there are some things that i guess i associate with the problem of naturalized agency and i want to check like which ones of these do you think like information physicalism is going to help me with it's the first thing on my list was world models that include the self and it seems like yeah this is naturalized information physicalism is going to deal with this yeah that's true so your world model is kind of describing the world there's no longer a cartesian boundary anywhere in there so the world definitely contains you or like you know if the world does not contain you then the agent is going to say okay this is a world where i don't exist so i can safely ignore it more or less okay the next one is sometimes people talk about logical uncertainty under this umbrella of naturalized induction are we going to understand that better so the way i look at it is that logical uncertainty is addressed to some extent by informationism and then it's addressed even better if you're doing this thing which i've been calling during reinforcement learning where you take an information agent and say okay you also have some computer that you can play with that has a bunch of computational resources and this is actually the starting point for physicals so when we go from this thing to physicalism then i don't think it adds anything about logical uncertainty equal logical uncertainty but more about how do we use this notion of logical uncertainty that we're getting from informationism or like informationism plus this touring reinforcement learning thing how do we use this notion of logical uncertainty and apply it to solve problems with naturalized induction so it's definitely related okay next thing on my list is open source game theory so that's this idea that like um i might be playing a game but like you have my source code so i don't just have to worry about like my actions i have to worry about like what you will reason about my actions because you have my source code uh is information physicalism gonna help me understand that so i don't think that's a particularly like when i was thinking about this kind of scenarios i'm well i don't really have a good solution yet but i'm imagining that some kind of solution can come from just looking at some more classical kind of cartesian setting it doesn't feel like the physicalism part is really necessary here because you can just kind of imagine having several cartesian agents that are kind of interacting with each other and that's probably gonna be fine so i'm not sure that the physicalism part actually adds something important to thinking about this yeah it actually makes me wonder so just from the informal thing you said there's this idea that like i have this world model and i look in it for my program being executed somewhere in the open source game theory setting it seems like my program is being executed in two places maybe it's not being executed but it's being reasoned about somehow like is there a way of picking that up yeah so there is certainly some relation in this regard but i'm still not sure that it's very important because like in some sense you already get that in information right like with information is we can deal with those kind of new combined situations where like outside world contains predictors of me and that's kind of a very similar type of thing so it feels like going to physical is mostly buys you something which kind of looks like having it's not exactly having a better prior but it's kind of similar to having a better prior more than like you know if if you can kind of formalize your situation by like i already have a good enough prior that let's just assume i have this prior which contains all the hypotheses i need and go from there again it doesn't seem like you really need physicalism at this stage and it seems like at least for the initial investigation understanding open source you know open source game theory should be possible with this kind of approach without invoking physicalism as far as i can tell i mean you do want to to do physicalism if you are considering more specific examples of it like a causal bargaining you know you're thinking of a causal bargaining with agents that exist in some other universes or weird things like that sorry what is a causal bargaining so a causal bargaining is the idea that if you have two agents that do not have any causal access to each other so they might be just really far from each other or they might be literally in different universes they don't have any physical communication channels with each other they might still do some kind of cooperation because those engines can kind of imagine the other universe and imagine the agent in the other universe and by kind of having this sort of thinking where each of them can kind of reason abstractly about the other they might strike deals for example i like apples but there's another agent which likes bananas but my universe in which i live is not super good for growing apples but it's really good for growing bananas but the other universe the ancient dislikes apples believes you know the other universe it has like the opposite property so we could both benefit if we start growing the fruit that the other agent likes in our own universe you know assuming that our utility function kind of considers that having bananas or apples in a different universe to be an actual game so so this is like an example of a causal bargaining and when you're stacking so this is in some sense a special case of open source game theory but here it becomes kind of more important to understand you know how do you even reason about how does agents reason about this how did he know that this other agent exists what kind of weight they assigned to those other universes or whatever and when you go to this kind of questions then physicalism definitely becomes important okay so the next thing on my list is uh logical counterfactuals so sometimes people say that it's like important to be able to do counterfactuals where you're like okay this like logical statement that i've proved already what if it were false what if like i've reasoned about this program and i know that it would do this thing but what if it did this other thing instead is information and physicalism going to help me with that yeah so what's happening here is and in here my answer would be similar to the question about logical uncertainty that in some sense informationism already gives you some notion of logical contrafactuals and this is what we're using here like we're using this notion of logical counterfactuals there is some nuance here because to get physicalism work correctly we need to be careful about how we define these control fractions and this is something we talked about in the article but like the core like mathematical technique here is just using this fact and information because like the basic idea is that in information is you can have this kind of nike uncertainty and what we're doing here is saying like assume you have guidance uncertainty about the outcome of a computation then you can build counterfactuals by forming those sort of logical conjunctions which you know you can like information is you can kind of define conjunctions of beliefs so once you have like not uncertainty about however this computation is going to be zero or one you can take the conjunction with you know assume this computation is zero and see what came out comes out from it and in some sense you can say that information agents the way they make decisions is by the same kind of contractual it's by kind of forming the conjunction with like what if i take this action okay cool and nigerian uncertainty uh am i right that that's just like when you don't know what probability distribution you should assign so maybe you use a contact set or something yeah yeah not uncertainty it's like when you instead of having the single probability you have like in your convict set you have probability distributions which assign different probabilities to this event okay and the last thing i wanted to ask about is uh self-improvement so sometimes people say like uh yeah this problem of what would happen if i modified myself um to be smarter or something people sometimes want this under the umbrella of naturalized induction will information physicalism help me with this yeah so i mean here also i feel like the initial investigation of this question at the very least can be done just with turing reinforcement sorry because like you know if you're already able to you have no realizability and you can kind of reason about computations then you can also reason about what happens if i like build a new agent and this agent might be smarter than me and so far this sort of questions are kind of already covered by informationism plus steering reinforcement learning in some sense the physicalism part helps you when you get into trouble with questions you have to do with the prior you know questions like entropic reasoning or causal interactions between agents from different universes and of course like the breachers the bridge rules themselves and so on and so forth questions where you can kind of imagine some kind of a simplistic uh synthetic toy model environment where you're doing things which is kind of in kind of abstract away dealing with actual physics and you know anything related to actual physics those are questions for which you don't really need physicalism at least on the level of solving them for for some basic toy setting okay great now that we've heard what information physicalism is roughly supposed to do let's move on a bit to how it works so the first thing i want to talk about is the world models used in inforbation physicalism can you give me some yeah more technical detail on like what's the structure of these world models that we're going to be using yeah so the world model is a belief about we're considering joint beliefs about physics and about computations so what what does that mean so physics is just you know there is some physical universe so there is some kind of a state space where you know we can think of it as a timeless state space like what are all the possible histories what are like all the possible ways the universe can be across time and space and whatever so there is some space which we don't find which is like the space of all things the universe can be and this space it can depend on the hypothesis so it can be anything like each hypothesis is allowed to to have to postulate its own space of of physical states and then the other part is the computations so that's what we denote by gamma and this gamma is mappings from programs to outputs so we can imagine something like you know the set of all turing machines or like the set of all programs for some universal turing machine which we assume to be um you know let's assume that every turing machine outputs a bit like zero or one so mappings from the set to to the set zero one we can think of it as like the set of computational universes like the set of ways that the universe can be in terms of like just abstract computations and not physics okay so this would include something like there's some universe where like when i run a program to check if the riemann hypothesis is true right in some like computational universe that program outputs like yes it is true in some computational universe it outputs no it's not true so uh how i should sort of understand it yeah this is about fraud except that of course in the specific case of the riemann hypothesis you don't really have you know we don't have a single turing machine that you can run and it will tell you this you can only check it for for like every finite approximation or something okay i actually wanted to ask about that because um in the real world there are some programs that like don't output anything they just like continue running and i don't necessarily know which programs those are so do i like do i also get to have uncertainty over which programs do you actually output anything yeah you don't necessarily know and in fact you necessarily don't know like because of the holiday pro right but yeah this is a good question and we initially thought about this but then at some stage i just realized that you don't actually need to worry about this because like you can just have your computation universe to assign an output to every program and some programs in reality are not going to hold and not going to produce any output but you just don't care about this so like like if a program doesn't hold it just means that you're never gonna have empirical evidence about whether it's output is zero or one so you're always gonna like remain uncertain about this but it doesn't matter really like you can just imagine that it has some output and in reality it doesn't have any output but since you don't know what the output is anyway it doesn't really bother you okay so it's sort of like our models are containing this like extra facts about these programs like we're imagining like yeah maybe once it runs for infinite time is it gonna output zero one like it seems like i'm entertaining the possibility okay so we have these models that's the computational universe and it's also the state space right yeah and then like your hypothesis is the joint belief about both of them so it means that it's some infant distribution over the product phi times gamma and we use some specific type of infrared distributions in the post which we called well actually for technical reasons we we call them ultra distributions in this booth but this is just an equivalent way of looking at the same thing so just instead of infinite distributions and utility functions we use ultra distributions and loss functions but that's just like a different notation which happens to be more convenient okay and in for distributions are just these convex sets right of distributions well yeah like what we call crisp infra distribution is just the convex set of distribution a general infrared distribution is something slightly more general than that like it can be a complex set of those so-called a measures or it can be described as a concave functional on the over the space of functions so it's something somewhat more general and specifically in this post we consider so-called homogeneous ultra distributions which is like a specific type and this is something that you can think about as like instead of having convex set over descript of distributions you have a complex set of what we call contributions which are measures that can sum up to less than one so like the total mass is less or equal to one and the set has to be closed under taking a smaller contribution like if you have a contribution you take one which is just lower everywhere then it also has to be in the step and and like those kinds of objects we call them homogeneous ultra ultra distributions and like your usual kind of vanilla set of distribution convex set of interdistributions is a special case of this okay he said we had this uh set of ultra distribution over phi times gamma gamma was the computational universe and what's phi and phi is the space of states of the physical universe states or histories maybe even more kind of timeless states or histories okay so that's the model we're using can you give me a sense of like so it seems like we're sort of putting this prior over you know over what computations might have what outputs do you have a sense of like what kind of prior i might want to use and whether i might want it to be entangled with the physical world because like it seems like that's allowed yeah you you absolutely super wanted to be in dangle because otherwise it's it's just not going to work like the sort of prior that you want to use here is a simplicity product because like you know that's something the whole point is having occam's razor so we haven't actually defined explicitly simplicity prior in the setting in the article but it's not difficult to imagine ways in which you could define it for example there's something that i have in an old positive post of mine where you can construct this kind of prior over you know this kind of convict says by taking like the solo model fire construction and instead of turing machines taking oracle machines and then like the oracle is kind of the source of nineteen uncertainty so it's not hard to to construct something analogous to the solomon prior for this type of hypothesis and this is what you should be using in some sense all right yeah when you say they should be entangled uh i guess that's because like a universe where like i physically type in some program and my physical actual computer outputs something like that should tell me about the computational universe right yeah like the fact the formulas works is like the entire comment is what tells you which computations are actually running so it's like okay yeah exactly like if you're running some computation on on your computer and you don't know what the result is going to be then you have some uncertainty about the computation and you have some uncertainty about the number which will appear on the screen which is like some property of the physical universe and and the two are intentional together right you know that whatever number appears on the screen is is the actual output of this computation okay and how much work do i have to do to specify that entanglement like does that come in some sense for free or when i'm like defining my models do i need to like carefully say like which like physical situations count as which computer programs well you know you start with a prior which is it by default it's like a non-informative product right so it's something like someone of course said anything good like maybe we have this entanglement maybe we have that entanglement maybe we have no entanglement maybe we have whatever like the the important thing is that the agent is able to use the empirical evidence it has in order to start narrowing things down or like updating away from this product in some sense like updating away in some sense because the formalism is actually updates it just like decides on some policy in advance but but like in practice like the things you do when you see particular things like a particular branch of on the tree of things that could happen would be dictated by some you know kind of subset of hypothesis which seemed more likely on on the street in some sense okay and i guess that's how i'm like learning about like which computations are being implemented if i'm thinking of the updateful version and you mentioned that we were going to have counterfactuals earlier can you say a little bit more about like what exactly counterfactuals about computations are going to look like here well the the basic idea is it's kind of simple like the basic idea is uh you know if there's a particular computation then we can consider you know we have some belief about computations or belief about universe times computation and then there's a particular computation and we want to consider the counter factual that its output is zero then all we need to do is take a subset so we had this convex set of of distributions or more precisely convex set of contributions and now all we need to do is take the subset of the contributions which assign probability zero to the other thing which is like not supposed to happen and that's that's our counterfactual and like that's the the basic building block here cool all right so yeah when we're talking a bit more specifically about that how the self is fitting in is the idea that like i've got this physical world model and like somehow i'm looking for parts which give evidence about the results of my computation or or yeah what's actually going on when i'm like trying to locate myself in a world model in this setting yeah so the key mathematical object here is something that we call the bridge transform that's actually what enables you to kind of locate yourself inside this this hypothesis or inside the universe described by tessai bodices and the idea here is that the bridge transform is a formal way of looking at this hypothesis and exploiting the entanglement between the computational part and the physical part in order to say which computations are actually running in the in the physical universe and which computations are not right so more precisely the more precise way to think about it is there are some facts about the computational universe that the physics knows that the physics kind of encodes or knows or whatever you like to call it and you can describe this as some subset of gamma like some element of two to the gamma or subset of gamma like the subset of gamma of like things that you know the universe knows that the computation the physical universe knows that the computational universe is somewhere inside the subset and then what the bridge transform does it starts with our hypothesis about v times gamma and transforms it to a hypothesis about phi times two to the gamma times gamma okay and how does it do that yeah so [Music] the idea here is this connects again to the question of contrafactuals because in some sense what does it mean for the universe to be writing a particular computation the answer we give here is phrased in informal terms is the universe runs a computation when the two counterfactuals corresponding or like the counterfactuals corresponding to the different outputs of this computation look differently in the physical world so if i consider the country factual in which a certain a program outputs zero versus the computation in which a certain program outputs one and i look how the physical universe looks like in those two counterfactuals if the universe looks like the same then it means that it's not actually writing this program or at least we cannot be sure that it's running this program whereas if they look completely differently like like two distributions with disjoint supports for example then we're assured that the universe is running this problem and there can also be an intermediate situation which you have like two distributions and they're kind of overlapping and then the size of the overlap determines the probability with which the universe is running this program okay so that's the bridge transformation and so basically we're saying like does the universe look different depending on how we say what the output of this program is and it tells us like which programs the universe is running how does that help us locate the self yeah the way it helps us locate the self is by using your own source code like if you're an agent and you know your own source code and you know we know that at least in like computer science world that's not hard to know your source code because you can always use quieting to kind of access your own source code in some sense then you can ask okay my source code so what does it mean your source code your source code is it's like a program that gets the history of your past observations in actions as input and produces a new action as output so this thing if the universe is running it with a particular input then it means that i exist in the universe and i observe this input right so and like conversely if i'm an agent and i know that i have seen a certain input then this allows me to say that okay this is information about a true hypothesis we know that the tribalism has the property that the universe is running the program which is me with this input okay so this basically means that the equivalent of bridge rules is just that like i'm checking if like my hypothesis about the universe wouldn't look any different depending on what i am or like what actions i would produce in response to given observations um is that roughly right yeah it means that like if suppose that you know like i'm looking at a hypothesis and i want to know that is this hypothesis predicting that i'm going to see a red room so i'm thinking okay suppose there's the program which is me which is getting an input which is the red room and it has an output which says should i lift my right hand or should i lift my my left hand and now i consider two counterfactuals the control faction in which upon seeing the red room i decide to leave my left hand in the counter factual in which upon seeing the red room i decide to lift my run here and those are kind of two computational counterfactuals because i'm just thinking about us in terms of computation there's the computation which is me receiving the input which is the red room and producing the output which is which arm to live so if in those two control factorials their bodice says that the physical universe looks different then this is equivalent to saying this hypothesis is actually predicting that i'm going to see a red room because there's a process says that the universe is running the program which is me with an input which is a red row okay so this is reminding me a bit of uh anthropics in particular in anthropics people sometimes wonder like how you should reason about different universes that might have more fewer copies of yourself so like if i learned that like there's one universe where there's like only one person just like me versus another universe where there are 10 people like me some people think like i should consider those just as likely where some people think like well the one where they're 10 times as many knees i should think that i'm like 10 times as likely to be in that one it sounds like like if you're just considering does the universe look different depending on the outputs of my actions it sounds like uh you're sort of equally waiting universes in which there's like just one of me versus universes where there's ten of me or like it seems like it's hard to distinguish those is that fair to say yeah it's absolutely fair so definitely the theory of antropics that ib physicalist agents have is a theory in which the number of copies it doesn't matter it doesn't matter if the universe is running one copy of you with a certain input or 80 copies that's not even a well-defined thing and it's not so surprising that it's not a well-defined thing because if you if you believe that it should be a well-defined thing then you quickly run into philosophical conundrums like okay if i'm just using a computer with thicker wires or whatever to run the same computation does it count as having more copies of the ai or whatever and the physical the answer is there's no such thing as number of copies there's like either i exist or i don't exist or maybe the hypothesis things exist with some probabilities but i mean there might be different copies in the sense that different branches right like there is me now observing something and then there are different things i can observe in the future and there are hypothesis which are going to predict that you will observe both like i'm i'm about to enter a room and the room is going to be either green or red and so my bodices are going to say well the universe is running you with with both inputs both with the input red room and the and the input green room so like in the sense there are two copies of you in the universe seeing different things but each branch like given a particular history of observations there's no notion of number of copies that see this history okay so you mentioned that like you should be suspicious of the number of copies of things because this argument involving computers with thick wires can you spell that out why do why do thick wires pose a problem for this view yeah i mean like think of an ai so what does it mean so and yeah it's a problem running a computer so what does it mean to have several copies so we can imagine like having some computer like a server standing somewhere and then like there's another server in another room and we think of it as two copies okay suppose that's two copies but now let's say that the two servers are standing in the same room next to each other is this still the corpus okay suppose it is now let's suppose that that instead this is just like a single computer but for purposes of redundancy every byte that's it's computed is computed twice to to like account for for random errors from custom crates or whatever does this still count as two copies you know at some point it's it's getting really not clear where the boundary is between different copies and just the same thing and it's really unclear how to define it in general case okay so getting back to information physicalism so we had these world models where like you had this uncertainty over the computational universe and the physical universe and you also have this bridge transform which lets you know like what computations are being run in a given universe and you can check if like your computation is being run in some universe next what i want to ask about is how you have loss functions because if i'm being an agent you know i need models and i also need like some kind of utility function or a loss function or something can you tell me what those will look like in the information physicalist setting yeah so this is another interesting question because in the cybernetic framework the loss function is just a function of your observations and actions and that in itself is something another thing about the cybernetic framework which is kind of problematic because in the real world we kind of expect agents to be able to care or you know assign some importance to things that they don't necessarily observe directly right like yeah so we can easily imagine agents caring about things that you don't directly observe like you know i care about some person suffering even though i don't see the person or a paperclip maximizer wants to make a lot of paperclips even if it doesn't see the paperclips all the time so one problem with the cybernetic framework is that you can only like assign rewards or losses or whatever it just things that you observe in the physical framework it's very different in physical framework your loss function is a function of gamma times 2 to the gamma in other words your loss function is a function of which computations are running roughly speaking and what are the outputs of these computations and this kind of can encode all sorts of things in the world right like if there's some computation in the world which is a person suffering then i can like care about the universe running this computation for example okay and basically that gets encoded in like yeah due to the gamma tells me which computations are running in the world and i can care about like having fewer computations like that or more computations like that yeah only there is a huge caveat and the huge caveat is what we call the monotonicity principle somebody since the principle is like the weirdest and most kind of controversial and unclear thing about the whole framework because the mutinicity principle basically says that your loss function has to be monotonic in which programs are running so physicalist agents the way we define them they can always be only be happier if more programs are running and more kind of upset if less programs are running they can never like have the opposite behavior and that's a kind of a weird constraint which we have all sorts of speculations about how to think about it but i don't think we have a really good understanding of how to think about yeah so it sounds like that means that like i can't like if there's some like program that's like a happy dude you know living a fun life and if there's some program that's like an unfortunate person who's wishes they don't exist so the monotonicity principle is saying like like i either prefer both of them happening or i disprefer both of them happening but i can't like like one happening but dislike the other happening yeah you can you cannot have a preference which says this program a specific it is like a bad program which i don't want to run like this is something that's not really allowed huh and why why do we have that principle in infirmationism why can't you just allow any loss function so the reason you have this is because the bridges form produces ultra distributions which are downwards closed meaning that so in in simple worlds what it means is that you can never be sure that some program is not running like you can be certain that some program is running but you always have like instead of being certain about the fact that some program is not running the the closest thing you can have is just not an uncertainty about whether it's writing or not it's like always possible that if it is running so given this situation it's not meaningful to try to prefer for a program not to run because you know it always might be running and because you resolve your knight uncertainty by you always like by checking the worst case the worst case is always gonna be like okay if you prefer this program not be wrong the worst case is always that it runs and then like there's nothing you can do about it and like the reason you cannot be certain that the program is not running well i mean the reason is that this is how the bridge transform works but in some sense it's a consequence of the fact that you can always have refinements of your state space like the state space phi which we discussed before you know we stated it kind of the space of all the ways the universe can be but describing word terms right we can kind of describe the ways the universe can be in terms of tennis balls in terms of people or in terms of like bacteria or something i don't know or in terms of atoms or in terms of quantum fields like you can have different levels of granularity and the kind of nuts thing is that we have a natural notion of refinement like given hypothesis we can consider various refinements which are like higher granularity descriptions of the universe which are consistent with the hypothesis that we have on this course level and the agent is kind of you should think of it as kind of always trying to find refinements of the hypothesis it has in order to to use this refinement to have less losses or like more utility and the thing is that this process is like not bounded by anything like the the engine always thinks that like it has 19 uncertainty as far as it knows the description of the university has it might be just a coarse grained description of something other more refined which is happening in the background and because you can always have more and more refinements like you cannot ever be any level of certainty that there is no longer any refinement you can always have more programs running because maybe the more refined description has additional programs running the more core screen description does not capture it's like you know maybe the description we have in terms of quantum fields maybe there are some sub quantum field thingies which we don't know about that encodes of lots of suffering humans and like we don't know that is how you think if you're an ugly physicalist okay so that does seem counterintuitive let's uh let's put that aside for the moment so you said that the loss function was just in terms of which computational universe you're actually in and what you know about the computational universe like uh which element of gamma and which element of t to the gamma so there's some intuition that it makes sense to want things about the physical state like in fact earlier you said like you might want to have a utility function that like just values like creating paper clips even if you don't know about the paper clips uh so yeah how do i phrase that in this setting yeah so this is an interesting question and you could try to to have lost functions that care directly by the physical state but that quickly runs into the same problems that you are trying to solve because then you end up again requiring to have some bridge rules as part of your hypothesis because you know the things you care about are not the quarks the things you care about are some complex microscopic thingies and then like you end up requiring your hypothesis to have some rituals that would explain how to produce this microscopic thing is and this kind of creates all the problems you were trying to avoid so the kind of radical answer that physicalist agents have or like at least kind of purist physicalist agents they say no we are just going to be computationalist we do not care directly about physical things we only care about computations so if you want to be a physicalist paperclip maximizer then what what you need to do in this case is have some model of physics in which it's possible to define the notion of paper clips and say the computations that i care about are the computation comprising this model of physics and dan it's like i want computations like if there's a computations that's simulating a physical universe it has a lot of paper clips in it that's the computation that you want to be running like that's what it means to be a favorite with maximizer if you're a physicalist okay cool and what would it look like to have a selfish loss function where i'm just like i want to be a certain way or i want to be happy or something yeah so if you're doing a selfish loss function then the computations you're looking at are your own source code right like we had this thing before which we used to define those contrafactuals which is like you know your own source code so you're checking where the universe is running your source codes with particular inputs it is also something that you can use to build your loss function you can say your last function is i want the universe to run the source code which is me with an input that says you know i'm i'm taking a nice path and eating some really tasty food or whatever okay all right cool and pro social loss functions would be something like me having a loss function that says i want there to be lots of programs that simulate like various people you know what they would do if they were in a nice bath and eating tasty food is that roughly right yeah you can think of like programs which represent other people or you can think about something like a program which which represents society as a whole like you can think of society also as a certain computation where like you have different people and you have like the interactions between those people and like all of this thing is some kind of a computation which is going on so you know i want the universe to be writing this kind of computation with some particular thingies which make me like this kind of society and not the other kind of society or whatever okay cool so another question i have is so one preference that it seems like you might want to express is if physics is this way i want this type of thing but if physics works some different way then i don't want that anymore so like you could imagine thinking like look if classical physics exists then i really care about there being like certain types of particles in certain arrangements but like if quantum physics exists then like i realized that i don't actually care about those i only cared about them because i thought they were made about particles and so like like maybe in the classical universe i cared about particles comprising the shapes of chairs but the quantum universe like then you know what i want is like i want there to be sofas but i don't want there to be chairs because like sofas fit the quantum universe better can i have preferences like that yeah so in some sense you cannot like if you're going with information physicalism that you're kind of commute you're you're committing to computational so committing to computationalism means that there is no difference between a thing in a simulation of a thing so this is actually a pipe like if if there's like some type of physical thing that you really want to exist or someone is just running a simulation of this physical thing from your perspective that's equally valuable and this is like an interesting philosophical thing which you can find objection objectionable or on the contrary you can find it kind of liberating in the sense that it it kind of absolves you from all the philosophical conundrums that come up with what is even the difference between something running in a simulation and not in the simulation you know like maybe we are all a simulation that's running inside quantum strings in some sense but does it mean that we are not real like like does it really matter where the universe is made of strings or quarks in the fundamental level of something else like should it change you know the extent to which i like trees and happy people and whatever the thing i like us so if you're a computationalist then you say i don't care i don't know what's the basic substrate of the universe i just care about the combination yeah it seems weird because it seems even hard to distinguish like like i might have thought that i could distinguish like here's one universe that actually runs on quantum mechanics but like it simulates a classical mechanical universe right that's world a in world b it actually runs on classical mechanics but there's a classical computer that simulates like the quantum world and it seems like an information physicalism like my loss function can't really distinguish world a from world b is that right yeah that's that's about right so you care about which computations are running or maybe what outputs they have but like you don't care about the physical implementation of these computations at all okay gotcha and so to wrap that up together so we have these world models and we have these loss functions can you just reiterate like how you make decisions given these world models and these loss functions yeah so the way you make decisions is by applying counterfactuals corresponding to different policies right so like you're you're kind of considering why do i follow this policy what do you follow that policy and now you're supposed to somehow compute what's your expected loss is going to be and choose the policy which has the minimal expected loss so the way you actually do it is you construct control factuals correspondingly different policies and the way you construct those control factuals is well to first approximation they're just the same kind of logical or computational conjunction we had before so they're just saying suppose that the computation which is me is producing outputs which are consistent with this policy and let's apply this contrafraction the way we we we always do it to our hypothesis or prior and let's evaluate the expected loss from that thing but the thing that you actually need to do is slightly more tricky than that because if you do this naively then you run into problems where you never gonna be able to to learn like you're never gonna be able to have good learning theoretical guarantees because the problem with this is is that if you're doing this this way then it means that you cannot rely on your memory to be true you know you wake up with certain memories but you don't know what were those memories or things that actually happened or maybe just someone simulating you already having these memories and because you don't know where these things actually happened you cannot really update on them and you cannot really learn anything with certainty but we can fix this by changing our definition of counterfactuals and the way we change the definition is kind of informally what it means is we don't take responsibility about the output of my own computation if it's not continuous like if there is some continuous sequence of like memories which is leading to a certain state the certain mental state then i take responsibility of what i'm going to do in this mental state but if something is discontinuous if like someone is just simulating me waking up with some weird memories then i don't take responsibility for what that weird simulation of me is going to do i don't use that in my lost calculus at all i just consider this as something external and not under my control okay and formally what what does that definition look like well more formally it looks like like what happens is to each policy you correspond some subset of two to the gamma times gamma like you know like the subset of things which are consistent with this policy in the inversion that would be just let's look at the computational universe gamma and let's look at the source code which is me and and see that it outputs something consistent with the policy in the more sophisticated version we're looking like okay let's also look at two of the gamma and see with which inputs the universe is actually running me and then we're only going to apply our constraints to those inputs which have a continuous history and like that's one thing we need to do and the other thing is need to do related to this envelope we haven't really discussed it before but there's like an important part where you're doing this kind of turing reinforcement learning things like your engine has some external computer that it can use to do computational experiments like run all sorts of programs and see what happens and there you also need a corresponding guarantee that your computer is actually working correctly right because like if your computer has bugs and it's just returning wrong things then you cannot really update on on seeing what it returns so it creates again problems with learning and like in order to not to have that thing you need to apply a similar things fix where you kind of only apply the constraint of like you know your source codes outputs are consistent with the policy on branches of the history in which you have only seen the computer saying true things okay so i i can sort of see how you would operationalize you only see the computer saying true things but what what does it look like to operationalize like there's this continuous history of your program existing like if i go to sleep or go under general anesthesia is that going to invalidate this condition that depends on your underlying model you know it assumes that in our formulas we just have like the inputs to our source code to our agent are just a sequence of actions and observations so like if you go to sleep with like some sequence of actions and observations and you wake up and the actions and observations continue from the same point right you know like the things in between they're just not contributing to the sequence that's continuous as far as you're concerned so like the continuity is not physical continuity but it's more like logical continuity the continuity means that like someone runs your code with a certain observation then someone runs your code with certain observation the action that you output him on the previous observation that's also important by the way like you're not allowed to like another type of thing which is kind of not allowed is someone running you on memories of you doing something which you wouldn't actually do that's also something which we kind of exclude so then we have like observation action leading from this observation in another observation and then you have like a sequence with three observations five seven and so on so those form kind of a continuous sequence so this is continuity like if you have some sequence on which someone is running you but there is some prefix of the sequence on which the universe is not running you then that's considered not continuous okay and the point of this was so that you could prove like lost bounds or something on agents what lust bounds are you actually able to get in this setting well we haven't actually proved any last bounds in this article but the i didn't show that you kind of can't can prove at least some very simplistic crossbones along the lines of like okay let's assume there is some experiment you can do which can tell you in which universe you are or like some expansion machine can do which can at least distinguish between some classes of universes in which you might be and let's further assume that this experiment in itself doesn't carry any loss or like almost any loss so just committing to this experiment doesn't cost you anything then in this situation the last agent would get would be as if it already knows in which universe or in which class universes it exists from the start or like a similar thing which you can do with computational thing is is like you know assume that it you can like run some computation on your computer and the fact of running it in itself doesn't cost you anything then you can kind of get a lost bound which corresponds to you know already knowing the result of this computation okay and that actually relates to another question that i wanted to ask which is like if you imagine implementing an agent in the information setting like how tractable can you actually compute the outputs of this agent or is this going to be something like ic where it's theoretically nice but you can't actually ever compute it well that's definitely going to depend on your prior which is like the same thing as with classical reinforcement 30. so like with cross-current enforcement learning if your prior is the solemn of prior you know you get axi which is uncomputable but if your prior is something else then you get something computable like you know you can take your prior to be some kind of a bounded sound card and then like the result is computable but like still very very expensive or you can take your prior to be something really simple like all mdps with some number of states and then like you know the result is just like the ps option you know the result is computable and like polynomial times number of states whatever so the same kind of game is going on with information depending on what prior you come up with you can have different levels of the communication complexity of computing the optional policy or like approximating the optimal policy is going to be different and we don't have like a super detailed theory which which already tells us what is happening in every case but even classical reinforcement learning without information is we have large gaps in our knowledge of of like which exactly priors are efficiently computable in this sense not to mention in the information case so i have some extremely preliminary results where you can do things like have information versions of markup decision processes and under some assumptions you can have like policies which are seem to be about as hard to compute as you know in the classical case or something but there is actually some paper which is not by me which just talks about kind of zero-sum games kind of reinforcement learning where you're kind of playing a zero-sum game but that actually can be shown to be equivalent to a certain information thinking and yeah they prove some kind of regret bond with i think an official group actually not 100 sure but i think they have a computational efficient algorithm there so at least in some cases you can have a computational efficient algorithm for this thing but the general question is very much open all right so getting more to questions about extensions and follow-ups what follow-up work to i guess either informationism in general or information physicalism in particular are you most excited about yeah so there's a lot of interesting directions that i want to push you with this um with this work at some point one direction which is really interesting is solving the interpretation of quantum mechanics so that's like something which um well i mean quantum mechanics you can say why do we even care about this you know who cares about quantum mechanics yeah it's going to kill us all so i think that's interesting in the sense that it's a very interesting test gates like the fact that we're so confused on the philosophical level about quantum mechanics seems to be an indication of our insufficiently good understanding of kind of metaphysics or epistemology or whatever and it seems to be kind of pretty related to this issue questions of natural induction which we have been talking about so if information physicalism is a good solution to naturalized induction then we should expect it to produce a good solution to all the confusion of quantum mechanics and here i have some fascinating but very preliminary work which shows that i think it can solve the confusion specifically i have some concrete mathematical way in which i can build an information hypothesis that corresponds to quantum mechanics and i think that i can prove well i kind of sketched it i haven't really written has enough detail to be completely confident except for some simple special cases but i believe that i can prove that with this construction the bridge transform reproduces all the normal predictions of quantum kings so if i'm right about this then it means that i actually know what quantum mechanics is i have a completely physical description of quantum mechanics like i know what the universe you know all these questions about like what actually exists does the wave function exist is the wave function only only you know description of subjective knowledge you know what what's going to happen if you're doing you're shredding your cat experiment on yourself all those questions are questions that i can answer now okay and and what does that look like so the way it looks like is basically that well in quantum mechanics we have different observables and usually you cannot measure different observables simultaneously unless they are like computing corresponding computing operators so what happens here is like you kind of imagine that you the universe like you know the universe has some hilbert space and it has some wave function which is like some states or some pure mixes matter stayed on on this albert space and then like you have all the observables and the universe is measuring all of them actually universe is measuring all of them and like the probability distribution over outcomes of each observable separately is just given by the bourne rule but the joint distribution and here it's a funny thing the joint distribution we just have complete like an uncertainty about what the correlations are so we're just imposing the born rule for every observable separately but we're not imposing anything at all about their correlations and i'm claiming that this thing produces the usual predictions of quantum mechanics and one way to think about it is when you're doing the bridge transform and then you're kind of doing this information decision theory on what results you're looking at the worst case and the worst case is when the minimal number of complications is running because of this monotonicity principle so in some sense what happens with quantum mechanics says the relevant or kind of the decision relevant joint distribution over all observables is the joint distribution which corresponds to making the minimal number of computation it's like as if the universe is really lazy and it wants to run as little computations as possible while having the correct marginal distributions and and that's what results so the interesting thing about it is you don't have any so there's no multiverse here like there's like every observable gets a particular value that just you have some randomness but it's just normal randomness it's not like some weird multiverse thing there's just only one universe you can get some weird things if you're doing some weird experiment on yourself where you're becoming assured in your cat and like doing some weird stuff like that you can get a situation where multiple copies of you exist but if you're not doing anything like that under just one branch one copy of everything and uh yeah that's that's what it looks like sure and does that end up so so in quantum mechanics there's this theoretical result that says like as long as experiments have single outcomes and as long as they're probabilistic in the way quantum mechanics says they are you can't have like what's called local realism which is like some sort of you know underlying true stat factor of the matter about what's going on that also basically obeys the laws of locality that also doesn't like internally do some like faster than light or backwards in time signaling and like there are various parts of that you can give up you can say like okay well it's just fine to not be local or you can say like experiments have multiple results or there's no underlying state or something which it sounds like in this interpretation you're like giving up locality is that right yeah it's definitely not local in the sense that it's like you know local is defined for for the purpose of this theory is it's absolutely not local but what's good about it is that it's still as far as i can tell it's still completely lawrence invariant so in this sense it's different from you know there's things like the burley bomb interpretation which also gives up on locality but then it also loses alerts and variants which is really bad and lorenzo variance is basically like do you play well with special relativity right yeah and here you don't have any problem like that okay that's interesting well i look forward to reading more about that in future are there any other uh follow-ups to this work that you're excited by yeah well there are a lot of interesting things that i kind of want to do with this work like proving physicalist regret bounds or some kind in more detail or like one thing that i kind of want to do is have some kind of an enhanced almond agreement theory for physicalists because what happens is that you know with the usual omen agreement theorem you have the problem that it kind of assumes that all agents have the same prior but if your agents are cartesian then in some sense they don't have the same prior because like each agent has a prior which is defined from its own subjective point of view and this can cause failures of agreement and in particular i claimed that like paul cristiano has this thing which is called the solomon of prior ismaili which can be summarized as like if your ai is doing something like solomon of prior then it can reach the conclusion that the world is actually a simulation by some other intelligent agents and those other intelligent agents can actually cause your ai to do bad things or like cause your eye to do things that they want it to do and this scenario is a kind of failure of oman agreement in the sense that the eye sees the same evidence as you just from a different vantage point but from the eye vantage point it reaches the conclusion that it's in a simulation even though from your vantage point you should not have reached the conclusion this isn't a simulation and i kind of conjecture that in information physicalism you can prove some kind of an almond agreement type theorem that's like different agents inhabiting the same universe have kind of this common agreement because they don't privilege their subjective points of view anymore and therefore this kind of failures cannot happen okay that's interesting so i guess another question is apart from follow-ups are there any compliments to this line of research that you're excited by so things other people are doing that work really nicely with informationism or information physicalism well i'm not super aware of things that people are actively doing and it obviously work very nicely i mean just a couple of days ago i read this post by abram damski in which he's talking about his thoughts about elucidating latent latent knowledge which is like this problem of you know we have an eye and we're trying to make the eye tell us everything that it knows about the world and kind of tell us honestly and a kind of a sub problem that he discusses there is this issue where if you have different agents they have different kind of subjective vantage point so what does it even mean for them to talk about some shared truth you know like what does it mean for the eye to tell us honest answers if the eyes truth is formulated in some completely different domain which is like its unsubjective point of view and here for example and abram even mentions computationalism as a way to solve it and the nice thing here is that information is actually formalizes computationalism and gives you this kind of shared domain which is like which programs are running and what values did it take which is kind of a shared domain of truth between different agents that you can use to to to kind of solve this kind of philosophical issue okay interesting i guess i'd like to talk a little bit about you as a researcher so what does it look like for you to do research well um i mean my research is a kind of very theoretical mathematical type of research there are like several components to the process one component is how do we translate the problems or the kind of philosophical or informal problems that we care about into mathematics and then there's like how do we solve the mathematical problems that result and those two processes they are kind of living in some kind of a closed loop with each other because if i come up with some mathematical model and then i analyze it and it leads to conclusions which don't make sense in terms of the way i want to apply this mathematical model then you know i know that i need to go back and revise my assumptions and on the other hand sometimes i'm playing with the month and something just comes out of playing with the muff which can tell me that huh maybe you know maybe it's a good idea to to make such and such assumption or like you know think about this informal concept in terms of this mathematical construct so there are two kind of coupled processes and um yeah what's guiding my thoughts is i mean there is a goal like like the goal is kind of causing ai not to kill everyone right and and then there's like sub calls that we can derive from this call which are like different problems that we want to understand better like different confusions that we have it might have implications on our ability to to think about ai safety or just directly thinking about what kind of ai designs can have what kind of safety guarantees and then like those kind of informal sub problems i'm then trying to come up with formal mathematical models for them and my process here is always kind of try to start with the most simplistic model that's not you know it's not meaningless like you know if i have some informal problem and i'm trying to think of a of some kind of a model for it i'm thinking let's make the most the strongest possible simplifying assumptions that i can imagine under which i can do something anything as long as it doesn't generate into something completely trivial you know like if i can make a bunch of simplifying assumptions and under those assumptions get a mathematical model in which i can prove something not trivial you know like i can prove something which actually requires work to prove it and not doesn't just follow directly from the definitions the way i define them then it makes me feel like i'm making some progress and you know once i have this kind of foothold i can then go back and say okay so in this simplified toy world we solve the problem now let's go back and see how can we remove some of those simplifying assumptions and then hopefully step by step you can kind of climb towards the solution which starts looking realistic you know of course it looks like many convergent lines of research like there are many problems and eventually you need to solve all of them and eventually they're all kind of coupled and interacting with each other but it's like with each of them you make a bunch of simplifying assumptions so you manage to kind of divide it from the other problems so there's some kind of divide and conquer going on but then as you start removing your simplifying assumptions and you also need to start merging those lines of research together until hopefully somewhere in the end we'll get a grand big theory of ai and ai alignment and we'll be able to just solve everything we need nice let's hope so i'm about to wrap up but before i do that i want to ask is there anything that you think i should have asked but i didn't i mean one thing we haven't really talked about is the whole malign prior thing and what are the implications of physicalism in this context sure so what is the malign priors thing so the prior is a problem articulated by paul christiano and recently there was also this post by mark sue which gave some explanation of liberation and summarizing of previous post about that and this problem is interesting because it seems really really weird and kind of you know it kind of triggers the absurdity heuristic really hard but the more you think about it the more convincing it seems at least for me so [Music] the guest of what is going on is the following that in usual cartesian approaches we have those bridge rules in those bridge rules they produce very very complex hypotheses and this means that simulation hypothesis can sometimes start looking a lot more attractive than non-simulation hypothesis why because like imagine that there is some universe with some relatively simple laws of physics it contains some kind of intelligent agents and those intelligent agents they want to take over our universe so what they might do in this case is and they're like no that in our universe you know we're building this ai which is going to be really powerful so what they're might do is run a simulation of rei seeing things in our universe and they run it in a way such that it's encoded in the most simple degrees of freedom that they can that they can manage to control like they're running the simulation and the output of the simulation is written into their analog of electron spins or whatever something that's fairly easy to describe in terms of basic physics as opposed to you know things like pixels on the camera which are very complex to this one if they're doing this thing then arya can say hmm okay what should i believe am i in this like normal universe with the normal earth but with this extremely complex break rules which look completely contrived and arbitrary and like you know give me a huge huge penalty to my hypothesis or maybe the correct thing is this simulation hypothesis with those much simpler bridge rules that you know come off from reading the output of the simulation of those electron spins or whatever that other thing now seems much much more plausible and what happens here is that the attackers sorry wait why why does it seem more plausible the other thing because the complexity is much lower so we don't have the complex bridge rules we have some much simpler ratios the complexity can be lower you're penalizing complex hypothesis yeah and that can be like super substantial you know like maybe you shave down off a hundred bits of your hypothesis complexity so now you have like a factor of two to the 100 in the relative likelihood so you can end up in a situation where your ai is not just considered as plausible but practically completely convicts that it's in a simulation like the simulation advance is just overwhelmingly likely from its perspective and what's happening here is that the attackers are exploiting the the eyes in a very very special position in the universe and the eye is thinking like i found myself according to the normal hypothesis is just completely random fact that i'm this agent in this place in the universe as opposed to a different agent or as supposed to just random clump of dust somewhere in outer space it's just a completely random fact that i need to hard code into my bodices that i'm this this agent whereas from the perspective of the upper hypothesis there is a completely logical mechanistic explanation of why i'm seeing this i'm seeing this because this is what the attackers want to attack okay and why is it why why do they count as attackers like why is this bad you know if they're simulating the same things well the bad thing comes because they can now once they're convinced are realized that it's an assimilation they can alter its predictions in arbitrary ways like when he is asking itself okay what do i expect to see in the future the simulators have complete control of that like the only the limitation they face is that if they make a prediction which is not true in our universe then once the eye sees that it's not true the simulation hypothesis has been falsified but by this point it might be too late they can carefully engineer this future prediction in such a way that will cause the eye do something irreversible and from this point it already doesn't matter that the prediction will be falsified in their simulation the eye should replace itself with a different ei that has different source code and if it does not do that then something completely horrible happens in the simulation if it does do that something absolutely amazing and wonderful happens and the eye is just going to do this and once the eye done this then it's already too late the fact that you know the future reveals that the amazing wonderful thing that was supposed to happen didn't actually happen it doesn't save us anymore because the eye already replaced itself with some different ai which is just doing whatever the attackers wanted to do yeah there are disruptions of this where i think the the simulating ai says okay i you know they say it has to follow this exact certain policy it's going to do all these things if it ever deviates from that incredibly terrible things were going to happen and you could just predict whatever you want because uh if if it's so terrible you can convince the ai you're attacking to just never deviate so it never finds out um seems like a different spin on the same thing yeah i mean like in reinforcements learning you have the problem indeed gets worse because like you said like counterfactual predictions can never be falsified if you're trying to kind of avoid this by having your eye only do forecasting or something like you're trying to you know do some kind of um iterated distillation and amplification thing in which your ai is not actually doing reinforcement learning but kind of only doing predictions then you're still you're still gonna lose because of this other problem that you know you're just the simulators are just gonna make it up with whatever prediction is going to to cause you know irreversible consequences in our universe that benefit them okay and basically the upshot is this is why we should be worried about these bridge rule formalisms and that's why we should prefer something like information physicalism i guess yes with the caveat that maybe physicalism can also kind of arise by itself in some sense and this is something that we're still a little confused about so when we wrote about the article what we actually wrote there is that you know like one question you can ask is okay maybe if your agent is some kind of an axi or cartesian agent maybe you can just discover physicalism as you know one of the hypotheses in this prior and from there there on it will just behave as like as a physical and what happens is if you're just trying to do this kind of very straightforwardly run into the problem where like if you're just trying to do this with axi then you run into the problem that in order to define physicalism the agent needs to use its own source code but axi has infinitely long source code because it's incomputable so this hypothesis is going to have like infinitely big description complexity but actually if instead of just axon you're thinking of kind of turing reinforcement learning thing then you can kind of go work around this because like the computer that you're kind of working on it might be able to run some short versions of your source code like you know if you're like axi this is just like a toy example but like if you're a xi but your computer has a halting oracle then like your computer can implement axi as a short program and you can kind of use that to to define a physical hypothesis so maybe some types of cartesian agents can well if they're doing like during reinforcement learning or whatever but they're still not doing the full physicalism thing then maybe they can kind of discover physicals on their own and that would kind of ameliorate some of their problems but it would still i mean you will still have some advantages if you just start off with the class and even if it doesn't actually matter in practice you don't actually need to explicitly make your agent's physicalism then it still seems really important if you want to analyze and like understand what's actually going to happen because this kind of shows you that like the actual hypothesis your agent ends up with it's something very different from what you could namely expect so you really need to take this into account when you're you know doing any kind of analysis okay so that was the last question i need to ask uh so i guess we're done now so just while we're wrapping up if people listen to this interview and you know they want to learn more they want to follow you and your work how should they do that so the easiest thing is just to follow me on a alignment forum so i post all of my work there more or less and if someone wants to ask me some concrete questions or like you know thinking of some collaboration or something then i'm always happy to discuss er alignment things so uh they are just welcome to email me at vanessa.com intelligence.org great well thanks for appearing on the show and the listeners i hope you'll join us again thank you for having me this episode is edited by jack garrett the opening and closing themes are also by jack garrett the financial costs of making this episode are covered by a grant from the long-term future fund to read a transcript of this episode or to learn how to support the podcast you can visit axerp.net finally if you have any feedback about this podcast you can email me at feedback axer.net [Music] foreign [Music] [Music] [Music] you
Related conversations
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
11 Apr 2024
AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -9 · 174 segs
Future of Life Institute Podcast
7 Jan 2026
How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -3 · 85 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
TED Talks
18 Dec 2023