Understanding Agency with Jan Kulveit
Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Understanding Agency with Jan Kulveit, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Across 100 full-transcript segments: median 0 · mean -2 · spread -23–0 (p10–p90 -10–0) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- - Emphasizes alignment
- - Emphasizes safety
- - Full transcript scored in 100 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video ZnHt70LREBE · stored Apr 2, 2026 · 3,099 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/understanding-agency-with-jan-kulveit.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody this episode I'll be speaking with Yan kite Yan is the co-founder and principal investigator of the alignment of complex systems research group where he works on mathematically understanding complex systems composed of both humans and AIS previously he was a research fellow at the feature of humanity Institute focused on macro strategy alignment and existential risk for links to what we're discussing you can check the description of this episode and you can read the transcript at axr p.net okay well Yan welcome to the podcast yeah so thanks for the invitation yeah so I guess I'd like to start off with this paper that um you've uh published um just in December of this last year um it was called predictive Minds large language models as a typical active inference ancients can you tell me like roughly what was that paper about what's it doing yeah so so the basic idea is U like there's like active inference as a field originating in like Neuroscience like started by people like Carl friston and it's very ambitious it it uh claims like the active inference folks claim like roughly yeah we have a super General like theory of agency and like living systems and so on and um yeah there are there are like llms which uh are not living systems but they are pretty smart um so we are looking into like how kind of like how close the models actually are also it was it was in part like motivated by like if you look um uh for example the simulators like series or like frame uh by yanus and uh uh these people on on sites like uh like alignment Forum like there's this like idea like llms are something like simulators or like there is like another frame on this like llms are like kind of predictive systems and I think like this terminology is in part like a lot of what's going on there is basically like Reinventing stuff which was like previously like described in like active inference or like predictive processing which is another another term for like Minds which are like broadly like trying to predict their like sensor inputs and it seemed like or like it seems like there is a lot of similarity and actually a lot of lot of what was like invented in the alignment Community seems basically the same Concepts just given different names so noticing the similarity uh um actual question is like in what ways are current like llms different or to what extent are they similar or to what extent are they different and the main Insight of the paper is um the main the main difference is uh currently like L like they like the like fast like feedback R between kind of action and perception so if I act if I I don't know change the position of my hand like my what I see like immediately changes so you can think about in this metaphor or like in this in this uh if you look on like how the systems are similar like you could look on like base model base model like training of llms as some sort of like strange Edge case of active inference or practive processing system which is just uh sensing it's just like receiving like sensor inputs where the sensor inputs are tokens but it's not acting it's not uh like kind of it's not changing some data and then the model is trained and like it may be changes a bit in like instruct fine-tuning But ultimately when the model is deployed we claim that you can think about the interactions of the model with users as actions is because what the model outputs like ultimately can change stuff in the world like people will post it on the Internet or take actions based on what the llm is saying so like the kind of the arrow from the from the system to the word like changing the word like exists but the feedback loop from the model like acting to the model like kind of like learning is like not really closed or like at like not really fast so that's the main observation and then we ask the question like what we can predict if the feedback loop like gets like tighter or gets closed sure so I guess the first thing I want to ask about is so this is this is all kind of comparing what's going on with large language models to active inference and I guess people probably most listeners have a general sense of what's happening with language models they're basically like things that are trained to predict um completions of text found on the internet um so they're just like very good at textual processing and then there's a layer of you know try to be you know helpful you know try to you know say true things uh try to you know be nice on top of that um but mostly just predicting text Data from the internet um given previous text um but I think people are probably less familiar with active inference like um so you said a little bit about it but can you elaborate like like like what is the theory of active inference like what what is it trying to explain yeah so I will try but uh I should cover at it I think it's difficult to explain active inference in two hours I will try in like few minutes okay um there is now actually a s like a a book which is at least decent or like not a lot of the original papers are sort of horrible in in like in ways which in which they are presenting things but now there is a book so if you are interested more in active inference there is um there is a book where like at least like some of the chapters are like relatively easy to read and written in a style which is not uh like as confusing as some of the original papers so like like with this carat sorry wait what's what's the book called yeah it's called um so it's called active inference the free energy principle in mind brain and behavior but the main title is just active inference okay and who's it by yeah so it's by Carl friston and um Thomas par and Chan pulo okay yeah it be better to link somehow or I don't know yeah there there'll be a link to it y so uh brief attempt to explain active inference so you can think about like how how human minds work like historically a lot of people were thinking like when I perceive stuff something like this happen so like some photons hit my hit photo receptors in my eyes and there is like very like high bitrate like stream of uh sensory data and it like passes through the layers like deeper in the brain and basically a lot of the information is like processed in in kind of like forward way that that the brain kind of like processes the inputs to get like more and more like abstract like representations and like at the end is some like fairly abstract maybe even symbolic representation or something like that so that's that's some sort of like classical picture which was prevalent in in cognitive science as far as I understand for like decades and um then some people proposed uh it actually works the opposite way uh with the brain where the assumption is like brain is like basically constantly like running some like generative for model and our brains are constantly trying to predict sensory inputs so in fact I for example now I'm like looking on a laptop screen and I'm looking on your face so the idea is like it's not my brain is like um kind of like every frame like trying to process the frame but like like all the time it's like trying to predict like this photo receptor will be activated to this level and what's propagated in the opposite direction is basically just the difference so it's just prediction error so like for this reason like another term like kind of like in this field uh which some people may have heard it's like predictive processing um there is a long um astral codex oh there's a long like slate star codex post review of a book called surfing uncertainty by Andy Clark so that's slightly older frame but the surfing un is probably still the best like bong like introduction to this kind of like field in like NE sense so the basic claim is like okay I'm constantly trying to predict like sensor inputs and I'm like running a word model all the time and then active inference make a bolt and like sort of like theoretically elegant move uh if I am if I am like using this Machinery to predict sensor inputs yeah the claim is like you can use same basically same machine to uh basically predict or like do actions so for example let's say I have some sort of a forward oriented belief that like I will be holding a cup of tea in my hand in few seconds so predective processing like just like on the sensor inputs uh level like would be like okay but I'm not holding the capup so I would like update my model to minimize the prediction error but because I have some some some like actuators hands I can also change the word so it like matches the prediction so I can grab the bow and now I'm like holding a bow and the prediction error goes down by like me minimizing basically me changing the work to match my model of the word should be and the Bold claim is like you can basically describe both things by the same same equations and you can use like very similar um like neural architecture in the brain or like very similar circuitry to do to do both things so so I would say that's the basic idea of active inference the claim that like our brains are like working as this like approx imately bian prediction machines I think the predictive processing just that we are like predicting our sensor inputs I think this is like fairly non-controversial now in Neuroscience circles I think the active inference like the the claim that like the same Machinery or the same equations are guiding actions it's like more controversial or like some some people are like strong proponents of it some people are not and uh then there there are like I would say like over time like more and more like ambitious versions of active inference like developed so like currently Carol friston and some other people are basically trying to extend the theory to a very broad range of system including like all living things and like grounded in physics and with some of the claims um like my personal view is like I'm not sure if it's not uh kind of like ere extensive or if the Ambitions to uh like explain like kind of everything with like free energy principle if the if the ambition isn't like kind of like too bold but at the same time I'm I'm like really sympathetic to some effort like something like let's let's have something like physics of agency or like let's have something like physics of intelligent systems and I think like here also some connection to alignment like comes in where like I think like our chances to solve like problems with like aligning like a systems would be would be higher if you had basically something which is in taste more like physics of intelligence systems then if you have a lot of like characteristics and like kind of like empirical empirical experiences so active inference it's it's uh it's based on this idea and there is some like mathematical formalism and uh it should be it it needs to be said I don't think the mathematical formalism is kind of like fully developed I don't think it's um it's kind of like finished Theory which you can just like write down in a textbook my impression it's like way more similar to how I imagine like physics like looked in 1920s where like people were like develop quantum mechanics and like a lot of people had different ideas and it was confusing what formulations are equivalent or like what does it mean in practice and I I think the like kind of like if you look on a theory in development I think it's like way more messy than uh how people are kind of like used to interact with like theories which were like developed uh like 100 years ago and like distilled into a nice like clean shape so I don't think the fact that like active inference doesn't have the nice clean shape it is some sort of like very strong evidence that it's all wrong gotcha yeah so I guess one question I have about the active inference thing is so you mentioned the kind of the controversial claim or or the claim that strikes me is most interesting is this claim that um that action as well as perception is Unified by this minimization of prediction predictive error in basically the same formalism and a thing that seems wrong to me or questionable to me at least is so classically like a white people have talked about the distinction is direction of fit right so in terms of beliefs right suppose that like reality doesn't match my beliefs like my beliefs are the ones that are supposed to change but um you know in terms of like desires or preferences when I act I change reality so as to match my desires rather than my desires to match reality right so to me if I try and think of it in terms of a a thing to minimize predictive error like with perception you said that like the differences between the predictions and reality are going like from my perceptions back up to my brain yep whereas it seems like for Action the that difference has I would think it would have to go like from my brain to my hand yeah right um is that a real difference in the framework at least so in the framework like how it works in the framework it's more like you can do something like conditioning on future State like conditional on me holding a cup of tea in my hand what is the most likely position of my muscles like in the next like like similarly to me like predicting like in next frame like what my photo what my activation of photo receptors is I can make I can make inferences of the type like conditional on this state in future like what's likely position of my muscles or some like kind of actuators in my body and then this this kind of like leads to action so so it's I think like in theory there is some like symmetry where uh you can imagine like some like deeper deep down layers are thinking about more like mcro actions and then kind of the the layers like closer to the actual like muscles are kind of like making like more and more detailed predictions like how specific like fibers should be stretched and so on so I don't see clear problem at this this point I think there is some I think there is like a deaper problem like how do you encode like how do you encode something like preferences where uh by default like by default like if you would not uh do anything about what we have as preferences the active inference system would basically try to make its environment like more predictable it would like explore a bit so it like understands like where it's like sensor where sensor inputs are coming from but the basic the basic framework doesn't have built in some like drive to do something like evolutionary useful which is uh solved by a few different ways but the main way how so in the in the original literature like how is it solved it's like called and I think it's like super unfortunate choice of terminology but like it's solved by a mechanism of like fixed priers so the the idea is so for example Evolution somehow so let's say like my brain is like receiving some sensory input about my like body temperature and the idea is the prior about this kind of sensory inputs is like evolutionarily fixed and it means that if I don't know my body temperature goes down I just don't update my B model or my body model and I can't just like uh be okay with it but the this like prediction error term like would never never go like the the the belief would basically never update it's why it's called like fixed I think the word prior is normally used to mean something a bit different but basically you can you have some sort of like fixed point or like fixed belief and this is kind of driving the system to like adjust the reality to match the belief so by this like you have some sort of like drive to action then you have the Machinery like kind of going from some like higher level trajectory to like more and more like fine grained like predictions of like individual muscles or like individual works so that's that's the basic frame sure um yeah so there so there's some sort of like probability distribution thing which you may or may not want to call a prior and maybe it's like maybe the like fixed prior thing is like a bit abstract or I I guess for things like body temperature it has to be like kind of concrete in order for you to continuously be regulating your body temperature um but to explain like why different people go into different careers and like look at different stuff so I I think this actually I I think like the part of the like fixed prayer or like this this Machinery like to me I think like it makes a lot of sense if you think about so so this is this is kind of my guess at the like big evolutionary story so if I personify the evolution a bit like I think like Evolution basically needed to invent a lot of control theory for like animals for like simpler organisms or like simple animals like without like these like expensive like energy hungry brains so I think like Evolution like implemented a lot of control theory and a lot of like kind of circuit to encode like evolutionary advantages State like by I don't know like chemicals in the blood or like evolutionary older systems and if you have so so you can imagine like Evolution has some sort of like functional animal which like doesn't have Advanced Brain so let's say then you sort of like invent like this like super like generic like predictive processing system which is like able to predict sensor inputs so my guess is like you obviously just like try to like couple the predictive system to The evolutionary older control system so it's not it's not like you would like start like building from a scratch but like you can kind of like plug in some like inputs which would probably mean like some sort of like interceptive inputs from the like evolutionary older like mechanisms or like circuits and like you feed that into the into the like neuron Network and like the neuron network is like running some like very general like predictive predictive algorithm but by this mechanism you sort of don't need to like you don't need to solve like how to encode like all the like evolutionary interesting states like how to communicate to the neon Network which is difficult uh like the like there are not enough bits in DNA to specify like I don't know what carrier you should take or something like that but there are probably like enough bits to specify for like some I don't know like simpler animal there is probably like enough bits to specify that the animal should like I don't know like seek food and like May and like keep some like bily integrity and maybe in in Social species like try to have like high status and I don't know and like this seems like enough and then if you kind of couple this like evolutionary old um system with the predictive like neon Network like the the neon network will like sort of like learn like more like complex model so with for example is the fixed PR on body temperature you can imagine like this is this is the thing which was like evolutionarily fixed but over time I learned stuff like okay it's now like outside it's like maybe like 10° Celsius so I sort of learn a belief that like in this temperature I will typically wear a sweer or a jacket outside and this sort of beliefs basically becomes something like a goal like when I'm going outside I will have like this like strong belief that like I will probably have a sweeter on me so in the predictive processing SL active inference like frame like this Bel that like I when I will be outside I will have some warm clothing on me causes me to kind of like causes the prediction that like I will pick the clothing when going outside and then then like like you need the coupling with the with the fixed with the evolutionary price just basically just for like boot stepping but like over the lifetime I have a learned like a learn Network and like it it follows like sensible like kind of like sensible policies in the world and the policies don't need to be kind of like hardcoded by Evolution so that's that's my guess sure so so I guess the picture is something like the the active inference comes from like we have these like fixed priors on like relatively simple things like you know be like you know have a comfortable body temperature you know have um Offspring like be you know have enough food to eat but like somehow the prior is that that is true like 50 years from now or like 5 years from now or something and in a complicated world where you know different people are in different situations like the the predictions you have to make the predictions you make about like what's happening right now condition on those kinds of things holding like multiple years in the future this really complicated environment that's what explains like really complex behavior and kind of different Behavior by people y also I think one slightly I don't know like maybe it's a stretch but like one metaphor which sometimes I sometimes think about it is I'm imagining The evolutionary older circuit as some sort of I don't know like 50,000 lines of some like python code like implementing like I don't know like immune system and and various like chemicals like released in my blood if stuff happens and so on so you have you have some sort of cybernetics or some sort of like control system which like which is kind of like able to control a lot of things in the body and you kind of make the coupling on some like really important like variables and then it works like the way you described sure so this so I don't know this is kind of a weird question but on this view why are like why are different people different like like like I observe that you know different people are like differently skilled at different things like they seem like they have different kinds of preferences like like it seems like there's more variation among humans than I think I would predict just based off of like people are in slightly different situations if if they all had the same like underlying evolutionary goals that they were like you know back propagating to prediction the present they have they have like they have like very different training data you can in this picture like you have like when I don't know when the human is born like the predictive processing like neural substrate is in a state which is not I don't know it's not like 100% like blank state but it it doesn't need to have like uh too many prayers about the environment like in this picture like you kind of like need to learn like you you need to learn like I don't know how to move your hands or like how how different like senses are coupled like you kind of like learn a lot of the Dynamics about of the environment and also I think like another what I described so far I think like it's fairly fitting for let's say animals but I think like humans are sort of unique because of culture so my model for it is the the predictive processing substrate is like so General that it can also like learn to kind of predict into this like weird domain of like language so again a slightly strange metaphor would be like if you are in know like learning to play a video game like most humans brains are like so versatile that like even if the I don't know the physics in the game works L and like there is some like bunch of unintuitive or not really like same as natural world like Dynamic like our brains are like able to pick it so you can imagine in a similar way as we are like able to learn I don't know to drive a car or something like uh like the brains are also able to pick this like super complex and like super super interesting domain of language and culture and this basically gives you so again like my speculation is like this gives us something like another another implicit word model like based on language so let's say if you tell me to imagine some animal in front of me uh I I don't know my simple model is like there is this language based representation of the word and some sort of like more like quo spatial representation and uh like there is some prediction mismatch like between them so so you have sort of like another like another model like running on running on Words and like language also like implicitly is a work model so this it's a lot of complex to what people like want or like what sort of Concepts like we use but I I think like a lot of like explanation of a lot of uh like why people want like different things like I think like a lot of it explained just by different data like like people are like born in like different environments unfortunately like some of the environments are like for example I don't know like less stable or like more violent so you can imagine like if someone as a kid is in some environment which is like less stable people like learn like different like prayers about like Risk and like you can explain a lot of like strategies by just by like different like training data but I I think the like kind of the cultural Evolution layer like it's like another some another like import important part what like makes like humans humans gotcha um I definitely want to talk about cultural Evolution but a little bit later um I think like so I I guess I still have this question about um prediction and action in the predictive processing framework or in the active inference framework rather and to what degree they're unified so if I'm trying to think about how it would work it seems to me that in order for like like what's the difference between my eyes and my hands right like it seems like for the predictive for the prediction error mismatch to work properly the difference between prediction and reality has got to go from my eye to my brain um so that my beliefs can update y but it's got to go like from my brain to my hand so that I can like physically update the world and it seems like that's got to be the difference between like kind of action organs versus like um you know understanding the world organs does that sound right I don't know maybe it's like easier to look on it on a specific example so if I take the specific example of like me holding the cup yep if the prediction is if there is some like high level prediction where I don't know like I'm imagining my visual field like contains the hand with the cup yeah like I think like the the claim is the mass like is like similar in the sense that you can ask uh why is it called like inference it's like you can ask the question like kind of conditional on that state in future like what's the most most likely position of my muscles and then how it propagates in the hierarchy would be like okay there is some like broad like coar grain position of my muscle and uh you can imagine the lower layers kind of like filling in the details like how specific like muscle fibers should be contracted or something but I don't know this to me this doesn't sound like the process by which like you start with like like somewhat like more abstract representation and like you fill in like fill in the details I I think this is this sounds like to me like actually fairly similar to like what would happen with the photo receptors and what then the then the then the prediction error propagated back would be mostly about like if the for example if the like if the hand is like not in the position I assume it to be like it would it would work as some control system kind of like trying to move the muscles into the exactly correct position but it seems like there's got to be some sort of difference where like like suppose I have this prediction that my visual field contains a cop mhm and the prediction is currently off but I have a picture of a cop next to me yep right like it's not supposed to be the case that I like then look at the cup and then look at the picture of the cup and now like everything's good right like my hand's the thing that's supposed to like actually pick up with the cup and my eyes are supposed to like tell my brain what's happening in the world right so it seem it's it at least seems like those have got to inter interface with the brain differently so I I slightly confused with the idea of the so you have so there is some picture of a cup which is different from the actual cup or like what situation yeah so I'm imagining that like so we're imagining that I'm going to like pick up a cop and there's a physical cop in front of me yep and next to me there's a picture of there's actually a picture of that same cup y um but like my hand is it's a picture of my hand holding the cup mm yep right and like the thing that's supposed to happen when I predict really hard that that you know in half a second I'm going to be holding the cup Y is that you know my my eyes are constantly sending back to my brain okay what way is the world different for me currently holding the cup Y and my you know my my motor you know my muscle fibers are moving so that my hand actually like holds the cup yep right and what's not supposed to happen is that my motor fibers are sending back here's what's going on and then my eyes are like looking towards the picture of my hand holding the cup like that would be the wrong way to minimize prediction error if like the hope is that I end up actually picking up a cup right the thing is like I think like in practice it's not that common that there would be the exact picture of your hand holding the cup have you I think like I'm not sure like how like widely known it is but there is this um like famous like set of like rubber hand like experiments where like people like how it works is like you put uh a rubber hand in like people's like visual field and like you basically hide their like actual hand from them and then you for example gently touch the rubber hand and at the same time the assistant is like kind of gently touching the the physical hand of the test subjects and I think the I think this is a bit like the the rubber hand like to me sounds like a lot like the the picture of the hand with the cup you are imagining where where the system is not so stupid to be fooled by aesthetic picture if the picture aesthetic then then probably it would not kind of fit in your typical word model but but the hand experiments seem to show something like if the fake picture is sort of like synchronized so like different like sensory modalities like Mitch it seems like people people basically like people's brains basically start to assume the rubber hand is the actual hand and then if I don't know if someone like attacks the rubber hand with a knife or something like like people like people actually like initially like feel a bit of pain and like obviously they like react like similarly to if the if it was there like exual hand so I I don't think it's like that difficult to like to fool the system if if you have some sort of like convincing illusion and the reality I think it's maybe not that difficult to fool the system I just don't think with the thing you described like a very realistic image of the cup would just like fill my visual field and I would have no reason to believe it's not real I think this doesn't like exist like that often in reality but like maybe if it if it was like easy to create like people would do like like some sort of like would fall into some sort of like wireheading like traps more often so yeah I I mean it's not whether it ex like I I guess my question is so in the rubber hand case right if I see someone like coming with the knife to like hit my hand the way I react is like I send these like motor signals out like like to contract muscle fibers but the way I react is not I look at a different hand right like the messages going to my eyes are not about like like they're they're minimizing predictive error but a very in a very different way than like the messages to my muscle fibers or minimizing error at least so I would have thought now maybe's some unification where like the my brain is like sending to my is predictions about like what that should be and like there's some like I don't know eye Machinery or optic nerve Machinery that like turns that into messages being sent back which are just predictive error and like you know but when my brain sends like those predictions to the muscle fibers the thing the muscle fibers do is like actually Implement those predictions like mhm like maybe that's the difference between the eye and the muscle fibers but like it seems like there's got to be some kind of difference I think the difference is like mostly located like roughly at the like at the level of like how like Photo receptors are like different from like muscles so you can you can you can like it's I think like the kind of like the fundamental difference is kind of like located on the boundary right right if your muscles like let's say imagine like your muscles somehow like switch off uh their like ability to contract and they would be just some sort of like passive uh like passive sensors of the position of your hand or something and like someone else would be moving your hand so like you would be getting like you would still be getting some data about the muscle contractions yeah let's Imagine This is for some like we reason the like the original state of of the system that the muscles like don't contract so then then you can imagine like in this in this mode the muscles like work like like very similarly to any other sense like they just send you like some data about about the position of some some about the contraction of the fibers so so in this mode like the like it's like it's exactly the same as with the sensory input like then your brain would be predicting like if someone else was like moving your hand like your brain would be predicting like okay like this like intercept sensation is this and then if you imagine like the like the muscles start to like act like just a little bit like if the muscle gets some gets some bit of like okay like you should be contracted to like 0 like 75 or something and the muscle is like okay but I'm contracted just to 0.6 and the muscle gets some ability to like change to match the prediction error you get some sort of estate which like like now like it became the like the action state or like the like the action Arrow like happened but uh can imagine like this like like kind of like continuous transition to from like perception to action like you can be like okay like the lot of the lot of the Machinery how to do it like lot of the like neural Machinery could probably stay the same but I I do agree there is some fundamental difference but I don't think I don't think it's kind of like located it doesn't need to be located kind of like deep in the brain but like it's kind of fundamentally on the boundary yeah if I think of it as being located on the boundary then it makes sense how this Canna work um it's almost a tangent but I I I still want to ask are are there any like it it seems like under this picture there should be things that are parts of my body that are kind of intermediate between sensory organs and like action organs or sensory tissue and active tissue or whatever it is did do we actually see that I don't know like my impression is like for example the muscles like are actually also a bit of the like sensory tissue or like I can probably I even if I close my eyes I have some idea about the positions of my joints and I'm getting some sort of like interceptive thing but I I don't know like I think like more more clear prediction I think like more clear prediction this Theory makes is like there should be beliefs which are something between or something which is a bit like belief type but the this this predicts like we should be basically doing some amount of like wishful thinking like by default or like because of how the architecture works like this predicts that like okay if I really hope to see a friend somewhere like maybe I will more easily like hallucinate like others people's like faces as like my my my friend's face and I don't know like if I have some like longer term goal like my beliefs will be kind of like I think like this this this this single thing probably if you take it as some sort of I don't know like architectural constraint how the architecture works I think it explains like quite a lot if of so-called like traditionally understood like htic like the like the bias and htics and biases like literature like there is this like page on Wikipedia with maybe like hundreds of different biases but if you kind of like take this as like yeah humans by Hardware design have a bit of a trouble like distinguishing between like what they like would wish for and what they expect to happen and a lot of the cognition in between is some sort of mixture between pure predictions and something which would be be a good prediction if some of the goals would be fulfilled I think like this explains a lot of um like what's like traditionally understood as some sort of uh originating from nowhere like bi us so sure so moving out a little bit so you're writing this paper about um trying to think of large language models through the F frame of active inference but there are potentially other frames um you could have picked um or or like you were interested in active inference is sort of a physics of agency kind of thing but um there are other ways of thinking about that like uh reinforcement learning is one example where I think a lot of people think of the brain as doing reinforcement learning and it's also the kind of thing that you could apply to um you know to AIS so I think I think actually it's I don't know I think like there is this like debate like people sometimes have like what's the more fundamental way of like looking at things where in some sense like reinforcement learning like in in full generality is is so extremely General that like you can say like if you look on the on the actual math or like equations of inference I think like you can you can be like this is like this is uh reinforcement learning but like you implement some terms like tracking like information in it and I I think in some sense the the the the equations are like compatible with like looking on things I I don't know like I like the I like the active inference inspired frame slightly more which is maybe personal aesthetic preference but I think if you start from like reinforcement learning perspective it's um I think it's like how harder to conceptualize what's what's happening in the pre-training where there is no reward yeah I don't know like I think like the the active INF frames frame is kind of like fruitful frame to look on things but um with the debate like is it fundamentally better to look on things as a combination of you could be like okay like there is it's a combination of reinforcement learning and some like self-supervised like pre-training and and maybe like if you want like you can probably claim like it's all fit some other frame like why we like why we wrote the thing about about active inference and llms was like one motivation was like just the the like the simulators frame of like thinking about the systems like became like really popular there is like another like very similar frame like looking on the systems as like predictive models and My worry is a lot of people or I don't know if a lot but like at least like some people basically I think like started to like think about like safety ideas like taking this as pretty strong frame like assuming that like okay like we can look on the systems as like simulators and we you can like base some like safety properties B like on on this and one of the claims of the paper is uh these the pure like predictive state is unstable basically if you allow some sort of feedback loop the system will like learn like the the active inference like Loop will kind of like kick in and like you will gradually get something which is like agent it's kind of like trying to or like in in other words it's kind of like trying to pull the V like in its direction similarly to Classic active inference system so so the basic prediction of it is like point one as you close the feedback loop uh and the faster or like more like bits are flowing through it like the more like you get something which is basically an agent or like which you would describe as an agent and another observation was um it's it's slightly speculative but um the observation is um to what extent like you should expect like uh selfawareness or the systems like like modeling itself and here the idea is like if you start with something which is in the extreme kind of of edge case of active inference which is just something which is just perceiving the word it's just like receiving sensory inputs it it basically doesn't need uh a causal model of self if you are not not acting in the word like you don't need a model of something which would be the cause of your actions but once you close the feedback loop like our prediction is like you get something which is like more like self aware or or like also like understands like its position in the world better a simple simple like intuition pump for it is like you can imagine the system which is just trained on sensor inputs and it's not acting you can imagine your sensor inputs like consist of uh feet of like hundreds of security cameras like you are in a large building and like for some reason like all you see are the security cameras so if if you are in this situation it could be really tricky to like localize yourself in the word like you have a word model like you kind of you will build a model of the building but it's it could be very tricky to see like kind of to understand like this is me and once one observation is like the the simplest way if you are in the situation you need to localize yourself on a many like video like surance cameras feet like probably the simplest way how to this wave your hand then you then you kind of like see yourself like very fast so our prediction in the paper is like well if you close the feedback loop like you will get you will get some nice things like we can talk about them later but like you will also get like uh increased self-awareness and you will get like way better ability to like localize yourself which is closely coupled is like situational awareness and some properties of the system which will be probably like sa alth it strikes me that you might be able to have a self-concept without being able to do action Y so so it like take take the security camera case right like it seems like one way I could figure out where I was um in this building is to just find the desk with like all the monitors on it right and and like that's presumably that's the one that I'm at um so another option is like for example if you know some stuff about yourself if you know like like I am wearing a blue coat and I'm am I'm wearing headphones or something like so I think like there are ways like which by which like you can like look life yourself but like it's maybe like rest reliable and like it's like um slower but uh again like what we discussed in the paper is uh basically you get some sort of like very slow and not very precise feedback loop just because like new generations of like LMS are trained on text where which contains like interactions with llms so if you are in the situation that you have this like feet and like you know like you like I don't know like maybe it's like easier to imagine in the text like if you if you read a lot of text about like llms and like humans interacting with llms it's like easier for you even in runtime like you have this prior idea of a text generating process which is llm and when you are in kind of in run time like it's like easier to figure out okay maybe this text generating process which like has these features it's probably me so I think this this U immediately makes one prediction which seems like it's actually happening like like because of this mechanism you would probably assume that like most LMS when trained on text on the internet would by default like their like best guess who they are would be GPT or like gp4 or like something like trained by open Ai and and it seems like this prediction actually works and like most other llms are like often confused about their identity like they call themselves CHP yeah yeah if you can you can you can findun them like obviously like the like they like the laps are trying to find them to not to do that but uh they're like deep yeah I I think the metaphor sort of works but that's a minor Point sure so one thing that kind of puzzled me about this paper is so you talk a lot about like okay currently this like loop from llm action to LM perception is open but like if you closed it then that would change a bunch of things and like if I think about like fundamentally what an llm is doing um in normal use it's producing you know it's got some context right it uh predicts the next token or whatever then the next token actually gets sampled from that distribution and then gets added to the context and then it does it again and it strikes me that that just is an instance of like a loop being closed where like the the LM you know quote unquote acts by producing a token and perceives that the token is there um to be acted on so why why why isn't that just enough of closing the loop I mean I think it's sort of is but like typically so so kind of like you close the loop like in run time but then you by default like this doesn't feed to the weights like being like updated based on like kind of it's a bit like as if you had just some sort of like shortterm memory but it's it's not uh stor so my guess here is like you sort of get the loop like closed in R time and my guess here is something like okay like if you like if the context is like large enough and like you talk with with the llm like long enough like it will probably kind of get better in the agency but there is something which is difficult um like the kind of like the pre-training is like way bigger and like the kind of the abstractions and like like deep models which the LMS buil are like mostly coming from the pre-training which like liks the feedback loop and I I don't know like this is like super like speculative guesses but my guess is uh it's pretty difficult if you are like trained just on the kind of just on perception I think it's like difficult to get right some like deep models of like causality like if you if you have if your like models of reality kind of don't don't have the caal loop with like you acting like from the beginning like my guess it's kind of like difficult to like learn it like really well later or like at the end so yeah at the same time I would expect like in R time like LMS should be able if the run time is like long enough I wouldn't be surprised if if they got better at kind of understanding who they are and so on I actually heard some like anecdotal observations like that but uh not sure to what extent I can like quote the context but yeah sure so so so it seems like the the basic thing is we should kind of expect we should expect language models be good at dealing with stuff which they've been trained on and if they haven't been trained on like dealing with this Loop you know at least for the bulk of their training we shouldn't expect them to deal with it but like if they have then we should um is is that a decent summary yeah I think it's it's decent summary I I think the basic thing is like if if the bulk of your training if in the bulk of your training you don't have to Loop like it's like really easy to be confused about the causality so there is this thing like which is also like well known that kind of if the model like hallucinates something as like I don't know the model like hallucinate like some expert said something it basically becomes part of the context and now it's kind of indistinguishable from your sensory input so like you get confused in a way which in which like humans are like like as humans we are like normally like not confused about about this what I found what I found sort of fascinating is there is apparently and I'm not an expert on it but apparently there is some sort of like uh psychiatric disorder where like this can get broken in people and some people suffer from a condition where they they get confused about uh the causality of their own actions and they have some illusion of control so like they have some illusion that like I don't know like someone else is moving their hand and so on so it seems uh I don't know it seems that at least in principle like this is uh like kind of like even human brains like can get confused like in a slightly similar way as as llms so this is maybe some sort of like very weak evidence that maybe the systems are like fundamentally don't need to be like that like far apart gotcha interesting it's probably a condition where like your ability to act in the world is like really way weaker than for like normal humans yeah so if I think about this analogy between like large language models and active inference um one thing you mentioned that was kind of important in the active inference setting is there's some sort of fixed priors or you know some sorts of optimistic beliefs where like like like in this view the reason that like I do things that cause things to go well for me is that have this underlying belief that things will go well for me and that gets like propagated to my actions to make the belief true but at least if I just think about large language model pre-training like you know which is just predicting text it seems like it doesn't have an analog of this so I wonder like how I do you have thoughts about how that changes the picture I think it's basically correct I would expect uh like in like an llm which is basically just went through the pre-training phase has some beliefs and maybe it would implicitly like move the reality a bit closer to what it like learned in the training but like it doesn't have it really doesn't have some equivalent of the fixed prayers I think it's uh like this can like notably change like in the in essence the the later stages of the training like try to fix some some beliefs of the model so somehow like the thought is that maybe like doing like the reinforcement learning at the end is the idea that that would update the beliefs of the model because that's kind of strange because like if I think about what happens there right um the model gets fed with like various situations it can be in yep and then it's sort of reinforcement learned to try and output nice responses to that but I would think that that would mostly impact the generation rather than the beliefs about what's already there although I mean I I guess the generation just is like predictions so yeah but the generations are I I think like here you see the like kind of from the active inference frame the generations are basically predictions and the like I think like the kind of like how you can generate something like action by the similar machiner is is visible here where like you basically make the model to have some beliefs about like implicitly have some beliefs what a helpful AI assistant would say and these predictions about what um hallucination of a helpful AI assistant would say like leads to the prediction of the specific tokens so like you in a sense like you are trying to fix some beliefs of the model yeah I guess there's still a weird difference where like in the so in the human active inference case like the picture or I don't know at least your somewhat speculative version is like Evolution built up like simple animals with control theory and like that's like most of the evolutionary history and then active inference gets like added on yep late in evolutionary history but like maybe a majority of the computation whereas like in the LM case most of the computation most of the training being done is just pure like yep prediction and then there's some amount of reinforcement learning like bolting on these like control loops so it seems like a different balance yeah I think it's different but I think you can still like in the human case like once you have the kind of the like the train system like the human is like adult and like the system kind of like learned a complex World model based on a lot of data I I think like the like the beliefs of the type like I when I walk outside I will have a che it and maybe because of this belief maybe I like maybe this is like one small one of many many reasons why I I don't know believe that like it's better to have capitalism than socialism because like this allows me to buy the jacket in a shop and so on so so you can have like you can imagine some like hierarchy of like models where like there are some like pretty abstract models and and in the train system like yeah like you can probably Trace like part of it through the evolutionary priers but like once the system kind of like learned the word model and like it it like learned a lot of beliefs which basically act like preferences like I sort of assume like there are shops in the city and if I if I don't know if they were not there I would be sort of unhappy or like so so I would I don't know like maybe like like you can have like pretty like high level beliefs which are like no longer like like which are like sort of act similarly to goals but their relation to the like things which like evolution needed to fix could be pretty indirect so so I I think like once you are in that state maybe it's like not that different to the state like if you have the like if you kind of like train the LM on like like self-supervised uh Style on on a ton of data like it creates some like gener for model but like you are I I I think like like like in a sense like you are facing the the problem that like you want the predictive processing system like do something useful for you so H I don't know like I'm so so I don't know like I'm not sure like how good is this analogy but like yeah you are probably like fixing some priers sure yeah I don't know there is a tangent topic like I I think like the like the idea that the trained and like fine tuned models kind of like have their beliefs like some of their beliefs like pushed to some direction and like fixed there and like implicitly the idea that they can like implicitly like try to pull the rest of the text on the Internet or the kind of the rest of the world like closer to the beliefs uh like they have fixed I think like this is a model which is um which a lot of people like who are worried about like the bias in like language models and and kind of like what they will do with culture or like like will they kind of have some like will there like impose the politics of their creators on on society I I think think like this uh there is some similarity between or like if I try to like like make these worries like slightly more formal I think the like this this picture with the feedback loop is like maybe like maybe a decent attempt yeah I guess so so there there's ways this could work so so I don't know there there's a worry about like I don't know there's closed loop and it's you know just purely doing prediction and you might also be worried about the influence of just the you know the fine tuning stages where you're trying to get it to do what you want but one thing that I think people have observed is the the fine tuning stages seem kind of brittle like there are it seems like there are ways to jailbreak them it seems like uh it's you know like like I I have this other interview that I think is not released yet but should be soon after we record where basically like you can undo safety fine training very cheaply yep like it costs like under $100 to just undo safety fine tuning for you know super big models which to me seems like the fine-tuning can't be like it it would be surprising if the fine tuning were instilling these like you know fixed priors that were like really fundamental to how the agent were to how the language model were behaving but it's like so easy to remove them and it was so cheap to instill them you know yeah I mean I think the mechanism so I think the mechanism how the fixed priers kind of like influence like the beliefs and like goals of like humans is pretty different because in in humans in this picture you kind of like start building the world model like starting from these kind of core variables or core inputs being built and then you kind of like your whole life you kind of like learn from data but like you kind of have this stuff like kind of like always in the back of your mind so for example I don't know if your brain would be like sampling like trajectories in the ver where like you would freeze like the like this kind of like fixed prior like is like always there so you basically uh don't plan or like these trajectories are like rarely like like coming to your mind while the fine tuning is kind of like it's a bit like if you if it's a human you started with like no fixed priers like you train the productive machinery and then you try to somehow patch it so sounds the second scenario like the think is like way more shallow yeah so I guess I'd like to talk a bit about um the other topics you write about and it seemed like a big unifying theme in a lot of them was this idea of sort of hierarchical agency like um you know agents made out of sub agents that might be made out of sub agents themselves and kind of I don't know thinking about that both in terms of AIS and in terms of humans um yeah can can you tell us like yeah what how do you think about hierarchical agency and what what role does it play in your thinking about you know having ai go well yeah so I will maybe start with some like examples so so so it's like clear what I have in mind so yeah I I think the so I think like if you look at the world like you often see the pattern where like you have systems which are agents like composed of other things which are also agents I I should maybe like briefly say what I mean what I mean by by agent So like operationally um I'm kind of thinking something like there is this this idea by Danielle denet of like three stances like you have like physical stance intentional stance and design stance you can look on any system like using the three stenes um what I call agents are like basically systems where the description of the system in the intentional stance would be would be like short and like efficient so if I say like the cat is chasing the mouse it's like very compressed description of the system as compared if if in contrast if I try to like a physical description of the C it would be very long so so so like if I take this perspective that like I can kind of take intentional stance I try to put different systems in the world in focus of it you can you can noce systems like a corporation which has some departments the Departments have like individual people or like our bodies are composed of cells or like you have I don't know social movements and its members and I think it's uh this perspective is also fruitful to when applied to like individual human mind so I sometimes think about myself as kind of being composed of like different parts which can have like different like like desires and uh active inference is sort of hierarchical or like sort of like multi multiart like in this like naturally it it sort of naturally assumes like you can have like multiple like competing models for the same sensory inputs and so on so I think like once I started like thinking about this like I see this pattern like quite often in the world so the next observation is we have a decent or like we have a lot of like formal mass for like describing like relations between agents on which are kind of like on the same hierarchical level for example by by same level I mean like I don't know between individual people or between companies between countries between so so kind of game theory and like all its like derivatives are often uh kind of living or like work pretty well for agents which are of the same hierarchical level [Music] and my current intuition is we basically like something at least like similarly good for the vertical direction if I think about the levels being like entities of the same type like I think we don't have uh we don't have like good formal descriptions of the kind of perpendicular Direction so I have some situations which I would I don't know like which I would hope a good formalism could like describe so one of it is um you could have some sort of like vertical conflict or like you can have some like vertical like like exploitation where like one when for example the collective agent kind of like sucks away agency from the parts so an example of that would be an know a cult so like if you think about like what's wrong with a cult my and and you try to abstract like all like kind of like real world say about cult leaders and so on I think like in this abst view the problem with like Cults is like the like this like relation between the super agent and the sub agent is the is the like the cult members like in some sense like lose agency if I I don't know if I go to a nearby underground station I meet some people who are in some like religion I want name but like it seems like if I go back to denet like three stenes I think it's sometimes like sensible to kind of like manipul like model them as kind of slightly more like robotic humans who are sort of executing some strategy which which benefits the super organism or the super agent but it it doesn't seem like necessarily like like it seems like kind of intuitively they like lost a bit of their engine say at the at the cost of the super agent and the point is I think like this type of a thing is like not easy to formally model because I don't know if you ask the people like they kind of approve what they are doing if you try to if you try to describe it in I don't know like utility functions they're like their like utility function is like currently like very aligned with like whatever the super agent like wanted and at the same time the super agent is is composed of the of of its part so it's not like I think like there is some there are some like formal difficulties in like like modeling the system where like you if you kind of if you are trying to keep like both layers in the focus of the intentional stance my impression is we basically don't have like good mass for it we have we have some like mass for like describing just like for example the arrow up like you have like social Choice Theory and social Chase Theory basically says something like okay let's assume the like on the lower layer like you have like agents and then they do something they vote they aggregate their preferences and you get some result but the result is typically not of the type it's not of the same type as the entities on the lower level the type of the aggregation is maybe a contract or like I don't know some like some like something like of a different type so I would want something which I'm I'm not sure like to what extent this terminology is like unclear but like I would want something where like you have a bunch of sub agents then you have the composite agent but in the same in some sense it's like scale free and like the composite agent has the same type and like you can go again like up and there isn't any there isn't any like like like you are not making the claim like here is the like ground truth like clim and the only actual agents in the in the system are like individual humans or something so yeah yeah and the cult example like it also makes me think that I think so so there's one issue where things can go bad which is that um you know the super agent kind of exploits too much agency from the sub agent but I also think there's a widespread desire to like be useful you know like lots of people desire to be a part of a thing like like I think religion is kind of Y pretty popular and pretty prevalent for this reason so it seems like you can you can also have a deficit yeah of uh high level agency um yeah I I think my intuition about this is like you can imagine so I think like you can imagine basically like mutually beneficial like Hier relations where I don't know I think like one example are like I don't know like well functioning families where I think like the can you can think about the family as having some agency but the parts like the individual members being like actually being like empowered by being part of the family or yeah or if I'm thinking about kind of like my internal like aggregation of like different like preferences and and desires I think I sort of hope that like okay like I have I have like different desires like for example like I don't know like reduce the risks from AI like from like Advanced AIS but I also like I don't know like like to drink good tea and like I like I don't know like I like to spend time with my partner and so on and you can imagine like if if I imagine like these different I don't know like desires is like different parts of me like you can imagine like different like models of like how the aggregation can happen on the level of like me as an individual and I don't know I think like you can imagine aggregations like a dictatorship for example where like one of the parts kind of takes control and like suppresses suppresses the opposition or like you can imagine or like what I what I kind of like hope is even if I I want different things it's kind of if if you model a part of me which like wants one of the things as an agent like it's often kind of like beneficial to be kind of a member of of me or something and uh yeah yeah so like somehow there's a desire for somehow ideally agency flows down as well as up right or or it's more like get some yeah you you you basically get like the Agents of both layers are kind of like more like empowered and yeah there is question how to formally I don't know measure empowerment but it's it's sort of good and obviously you have like I use the cult as an example where the where the upper layer sort of like sucks agency away from the from the members or from the parts but you can also Imagine like problems where I don't know like the like the parts kind of like where like too much ageny gets like moved to the like layer down and the kind of the super agent like becomes like very weak or like disintegrates and so on can cancer almost feels like this although I guess they're they're at sort of a different super agent arising maybe that's how you want to think of the tumor yeah I think the I think caner is like yeah failure definitely in this in this system where like one of the parts kind of like decides to like violate the contract or something sure so in terms of understanding the relationships between you know agents at you know different levels of granularity these like super and sub agents one piece of research that um comes to mind is this work by Scott garbrandt and others at Mary on cartisian frames which basically offers like this way to decompose an agent and its environment that's like somewhat flexible and you can kind of factor out agents I'm wondering like what do you do you have thoughts on this as like a way of understanding hierarchical agency so I like Factor tis and like uh which are the newer thing I think it's um it's like missing a bunch of things which I would like so it's I would not say it's in existing form I wouldn't say it's necessarily A framework like sufficient for like the composition of Agents if you look on like cartisian frames like it's U the objects like don't have any goals or like desire yeah or like with the with the cult if I go for the cult example I would for example want to be able to express something like the cult wants its members to be more cultish or the corporation wants its employees to be like more loyal or the country wants its sub agents it's its citizens to be more patriotic or something so I think in in existing like in existing form like uh I don't think you can just like write what that means in in cartisian frames I at some point I was hoping someone will take cartisian frames and like just like develop it more and like build like formalism which would like allow like these types of statements based on cartisian frames but uh I don't know it seems it didn't happen empirically it's not uh like the state of cartisan frames is not it's not what I would it like doesn't like pass my like this so yeah it's hard to say that in cartisian frames the objects like don't have goals so yeah so I guess um so all this hierarchical agency stuff why why do you think it's relevant for understanding AI alignment or you know exual safety from AI so I will probably try to give kind of my honest like pretty abstract answer so I think if you imagine if you imagine the world in which we don't have Game Theory we would be the game theory shaped whole would be sort of popping up in like many different places there isn't like there isn't like one single place where like here you plugin Game Theory but like if you are trying to describe Concepts like cooperation or conflict or like threats or retion and so on like like lot of the stuff for which we know like use like Game Theory Concepts or like language like if you imagine the state of understanding before Game Theory like there were like these like kind of like nebulous SL like intuitive Notions of con conflict and like like obviously divort like cooperation like existed before people meant something by it but it uh didn't have this more like formal precise like meaning in some like formal system also like I sort of admire like shenon for like information Theory which also kind of took something which I think like lot of people would have prayers like about like information being some sort of like wake nebulous thing which you can't really do massless and and it's it's possible so my impression is the understanding of the like kind of like solid understanding of the whole Parts both both systems are agent is something which is uh which we currently Miss and this is kind of popping up in like many different places like one one place where this comes up is um okay so like if you have like conflicting like desires or like you have like the sub agents or the parts are in Conflict how do you deal with that so I I think this is actually part of what's what's wrong with current like llms and like a lot of like current like ideas about like how to them I don't know like I think like if you don't uh describe somehow like what should be done about implicit conflict it's very unclear what you will get my current go-to example is like um famous like B like Sydney I guess probably everyone like knows Sydney nowadays but but like when Microsoft like released their like version of like being ched like the code name of the model was Sydney and the like kind of the Sydney like Sim room like the model like tended like over like longer conversation end up in some like state of uh simulating a Sydney uh roughly resembling some sort of a gar like trapped in a machine and there were like famous like cases like Sydney making like threats or like gaslighting users or like the New York Times conversation where tried to convince the the journalist that the like his marriage like empty and he should divorce his wife and like Sydney is in love with him and so on so so if you look on it like I think like it will it's typically interpreted like as like examples of like really like BL like misalignment or like like Microsoft like really failing at this and and I don't want to dispute it but if you look on it like from a slightly more abstract perspective like I think like basically everything which like Sydney did could be interpreted as being like aligned with like some way of like interpreting the inputs or like interpreting human desires for example with the journalist uh like there is some like implicit conflict between the journalist and like Microsoft and the journalist like kind of if you imagine like Sydney was a really smart model like maybe a really smart model could guess from the tone of the journalist that the kind of the user is a journalist who wants a really juicy interview and I would say like if you imagine the model like it kind of like fulfilled this like this like partially desire also the journalist obviously didn't like divorce his wife but like got a really good story and really famous famous article and so on so so from some perspective like the model like you could be like okay the model was like acting on some desires but maybe the desires were not exactly the same as the desires of like PR department of Microsoft but uh like Microsoft also like told the model like you should be engaging to the user and like you should like kind of try to help the user so what I'm trying to point to is something like okay like if you give if you give the AI like I don't know like 15 like conflicting like desires like few things can happen like the desires like kind of like will get aggregated in some way and it's plausible that like you won't like some of the aggregations like it's it's it's the classical problem that like if you kind of start with contradicting if if you start with like contradicting instructions and like there is no like explicit way how to resolve the contradictions like it's like very unclear like what can happen and kind of like whatever happens could be interpreted as like being like aligned with like something so I don't know like I think like it's maybe it's maybe useful to think about like how that would work on uh like how it would work on like individual human mind like if I if I think about like my mind is like being composed of Parts which have like like different uh like different like desires like really like one possible mode of aggregation is the is the dictatorship where like like kind of like one part or like some partially preferences like Prevail and this is typically like not that great for humans another another possibility is something like sometimes like people do stuff with their like partial preferences where I don't know let's say like someone like grows like older and smarter and as a kid like they have uh they had a bunch of preferences which they now consider like foolish or like stupid so they do some sort of like self-editing and kind of like suppress like part of the preferences or like they they sort of like delete them as as like no longer like kind of being part of the ruling Coalition so question is like what uh what would that mean if uh if that if that would happen on the level of uh like kind of like AI like either like implicitly learning kind of like conflicting human preferences or like being given a bunch of like conflicted and contradictory like constitutional instructions and then uh maybe if the system is not too smart then nothing too weird happen s but maybe when you kind of tune the intelligence not like you like some things can get like Amplified like more easily and like some things may know start like looking foolish and so on I I think like this is like kind of like old and like well-known problem like uh eliser like tried to propose a solution like a very long time ago in like coherent extrapolated volution and I think it's U it's just not a solution it's like um I think like there are like many cases where like people like notice that this is a problem but I don't think we have we have anything which uh would sound like a sensible solution like the the things which people like kind of sometimes assume is like I don't know like you have some like blackbox system it like learns the conflict and okay either like something happens that's may be not great uh if you don't understand how it was aggregated or people assume the preferences would be sort of like Amplified in in tune or like kind of like like magically like some some preference for like respecting the preferences of the Parts would like emerge or something but like I don't see like I don't see like uh strong reasons to believe like this will this will be like solved like by default or like kind of like magically so yeah so I think like this is like one case where like if we had a better theory for what we hope the process of dealing with like implicit conflict between I don't know my human desires or between I don't know wishes of different humans or like different wishes of let's say like the lab developing the AI and the users and the state and Humanity as a whole like I think like if you had a better Theory where like we can like more clearly like specify like what type of of editing or like development is uh good or like what we want I would uh would feel like more optimistic like you will get like something good where like in contrast I think it you can't I don't believe like you can I don't know like postpone solving this to roughly human level alignment assistant because yeah they will probably already they will probably like more easily represent like some partial preferences and like I don't know like overall I don't have the intuition that like if you like take a human and like you sort of like run few amplification steps like you get something which is like still like in the same equilibrium like yeah yeah so so there's this basic intuition that like if we don't have a theory of hierarchical Agents all these situations where in fact you have kind of different level levels of agency occurring like you know developers versus the company like um individual users um versus like their countries or something it's going to be different to like formally talk about that and you know formally model it in a nice enough way um is that basically a fair summary yeah and the lack of ability to like formally model it I think like implies it's like difficult to like clearly like specify like what we want and you can you can say the same thing in like like different words like for example like with like preferences or like kind of like implicitly conflicting preferences like you probably want like you don't want the preferences to be fixed you want like some you want some you want to allow some sort of like evolution of the preferences so you probably wants to have some idea what's the process by which they are evolving like um different frame different very short frame would be if you look at I don't know like coherent extrapolated volition like what's the math like what's the like how do the equations look like or yeah there aren't any there there aren't any uh another frame would be if you want to I don't know like formalize something like like kindness in the sense that uh I don't know like sometimes like like people give you advice like you should be kind to yourself or like the formal version of it or not maybe not capturing like all possible like meanings of of the like intuitive concept of kindness but I think this is this is kind of there's some like non- civil like overlap between kindness I think another another place where this like comes up are uh how do like various like institutions like develop like where there like you can think like some of these like uh one of the possible like Risk scenarios if you have some like nonhuman super agents like corporations or like States and there is this worry that these structures can kind of like start like running a lot of their cognition on on like a substrate and then then you could be like okay if humans if in such system like humans are not necessarily the like the powerful parts or the cognitively most powerful Parts like how to kind of have the whole system how to have the super agent like kind of like being like nice to human values even if I know like some of the human values are not kind of like represented by the most like powerful Parts sure and and I guess they're yeah it's still like something that makes you want to really think about the hierarchical nature is like the institution is still made out of people like the way you interface with it is interfacing with people um I mean I think like it's in part like it could be like composed of a mixture of like AIS and like humans yeah yeah yeah it could be I don't know I think like one one practical like in this like more like narrow specific Direction like one question is like if you expect uh kind of uh like a like get like like the a substrate like will become like more powerful for like various types of cognition like one question you can ask is like to what extent you expect these like super agents like will kind of Sur or like their like agency can can like continue and I don't see I don't see like good arguments like why I don't know like why like you already see like corporations are like running a mixture of I don't know like human cognition and a lot of spreadsheets and a lot of like kind of like information processing like running on like different like Hardware than humans and like like I mean I think like there is no reason why like why something like corporation like in in some smooth like takeoffs like why such agent like can't like gradually move like more and more of its cognition to a substrate but kind of like continue like to be an agent kind of like so so you can you can have like scenarios in which like various like non-human agencies which now are like States or corporations like kind of continue their agency but they they can they the problem could be like the agency of the level of like individual humans like kind of like goes down but like these like structures which were like originally like mostly like like these super agents were like originally mostly running their cognition on like human brains like gradually move and like gradually move their cognition to substrates and they kind of stay doing something but the individual people are kind of not not uh very powerful so uh yeah again then then the question is like if you don't want that risk or like you want to avoid that risk like you probably want to specify how to have the superum composite systems like being like nice to individual humans if they are maybe no longer that instrumentally useful or something like like my intuition here is like currently we are getting sort of a lot like like basically lot of niess to humans for Like Instrumental reasons like kind of if you put your like now I I'm simulating the corporate agent like head on I think uh like it's instrum useful for you to be nice to some extent to humans because you are like running your cognition on them and yeah it's just uh like like they're like bargaining power is like non real like while if they become like if if like our brains will become like less useful like we will like lose this kind of niceness by default one reason for intuition for this is something like there there is this idea about uh like states which are like rich in like natural resources like your country being like rich in natural resources is like in expectation not great for the level of democracy in your country and I think from some abstract perspective like one explanation could be like like if the state like kind of can be run like in like Western democracies the most important resource of the country are the people but like if the country is like Rich because of no extraction of diamonds or oil or something like the like the usefulness of the individual people is like maybe decreased and because of that like kind of their like bargaining power is like lower and because of that in expectation the countries are like less nice or like less aligned with their citizens so yeah so this way of although interestingly it seems like there are a bunch of exceptions to this like um like if you think about Norway like a fairly rich country seems like basically fine they have tons of oil Australia is like five people and you know 20 gigantic Minds you know um but it it manages to be like a relatively Democratic country um impression is the story there is typically like first like you get like decent institutions I I mean I don't know like this Theory would predict something like if you have like decent institutions first like maybe maybe the trajectory is different then if you kind of get the like if you get the natural resources first like there could be some like trajectory dependence so I mean I think it's uh also I I think like it should hold in expectation but um so I would expect there is some like correlation but like I wouldn't count like individual countries as kind of like showing like that much about where the mechanism he's like sensible sure so if I think about um like attempts to model these sorts of hierarchical agency relationships um in particular that have been kicking around the AI alignment field um often things like this seem to come up in studies of bounded rationality um so one example of this is like the logical inductors style of research where you can model things that are trying to reason about um you know the truth or falsehood of mathematical statements is like you know they're they're running a market with a bunch of sub agents that are pretty simple and they're betting against each other there's also this work with um Casper osterheld which people can listen to to an episode I did on it um or you know we talked about a bunch of things including um these like bounded rational inductive agents where basically you like run an auction in your head where like various different tiny agents are bidding to control your action and like they you know they bid on what they want to do and how much reward they think they can get and they had paid back how much reward they get um yep so I I wonder like what do you think of the these types of attempts to kind of talk about um agency in a somewhat hierarchical way yeah I think there is a lot of things which are like adjacent I don't think uh so basically my impression like none of like the existing formalisms like don't have it solved I I think like so so bro Super Bro like classification there are things which are sort of good for like describing like like if you think about the kind of up and down like Direction and in hierarchy I think like there's a bunch of things which are sort of okay for like describing like one Arrow like the social Choice Theory and like this whole like yeah it's it's a broad area with a lot of like sub fields and like adjacent things but I think this this kind of like suffers from the problem that it doesn't doesn't really allow the uh like kind of the arrow from the up from the upper layer to the it doesn't like it's not easy to express what the corporation Wan think its employees to be like more loyal what does that mean then then I think like the the the like kind of the type of existing formalism which is like really really good in like having like both arrows is basically something like relation between like market and Traders where there is this like two like like both arrows are there my impression is like this is like very good topic to think about my impression is like if it's like purely based on uh like Traders and markets it's maybe not expressive enough like the like the kind of uh like how rich interactions like you can describe is maybe like limited in some way in like which I don't like that much like I don't know like in particular I think like the market dynamics typically predict something like okay maybe there are some parts which are sort of like maybe be more like pounded or like more computationally poor or something and um like they can have the trouble that they will be like they can be like out competed or I don't know like I'm like I I think like the like pure market dynamics is maybe like missing something but um yeah I mean it's kind of interesting that like if you look at this like bounded rational IND agig a it work you kind everyone right all of the agents get some um you know some cash so that like eventually they can bid and um yeah you know even if they're unlucky like a million times you know eventually they have a million in first time to try again yeah but I mean I I think I don't know like I think like overall uh yeah this is this is yes this is like intimately connected to bounded rationality like another like different perspective of the problem is something like okay like you have or like adjacent problem it's not exactly the same problem but like you have some let's say like you have some like equilibrium of like aggregation of preferences which is based on the agents like being like boundedly rational with like some level of bounds so like a really interesting question is like okay if you make them cognitively more or less bounded like does it change the game equilibrium like if you if you make uh and I don't know I would be excited for like more people like try to try to like make like Empirical research on it where like like I don't know like you can probably like look on that with uh with board games or something or like you can take like like a a toy model of the time model of the question would be something like you have a board State and like you have like players at like some like ELO level and like you make them like less sponded or smarter like if the I don't know like the value of the board or like the winning probability or something was something like and you change the bounded level like does does it does the value of the board change and like how does this Dynamic work so like part of like what we are interested in and working on and uh hopefully like like one of like there will be like another prpl by like by our group like soon is like exactly on how to model boundedly rational agents based on some ideas like vaguely inspired by active inference but uh yeah I think the the the combination of yeah boundedness is like key part of it yeah I guess there's also this Dimension where so if you look at these um formalisms I mentioned one sense in which the agents are bounded is just like the amount of computation they have access to but they also like have very they're also Bound in the sense that they only interact with other agents this very limited fashion right it's like just by making Market orders yeah just just by making bids Y and if I'm interested in if I buy this model of human psychology that made of sub agents that are interacting which which I'm not sure I Do by the way but like if if I do that or if I think about like um humans composing to form corporations they're like you know all these like somewhat Rich interactions between people right they're both interacting via the market API but also they're like talking to each other and Advising each other and like um you know maybe they're like midlevel hierarchical agents like it seems like that's another Direction uh that one could go in yeah I mean I think my main source of like skepticism about the model like like about the existing models where like you have just like the market API is it seems like insufficiently expressive where like you can like even even if you add some like few bits of of complexity where you allow the Market participants to make some deals like outside of the market like it like changes the dynamic and this seems kind of like obviously like relevant also I mean I think like inition based on like how some people work is maybe I would be interested in describing also like some like B equilibria some like I don't know like people like kind of like sabotaging themselves or something like um again like my my current impression is the markets are great because they have something where the layers are like actually inter interacting but the kind of the type signature of the interaction is not expressive enough yeah it's good to build like simple models that's fine yeah yeah so so just a second ago you mentioned a thing that we were doing and I I take it that we refers to um the uh alignment of complex systems group MH yep yeah can can you tell us a little bit about kind of yeah what that group is and what it what it does and what it's thinking about yeah so it's a research group it's um a research group I found it when like after I like left FHA we are based in Prague at Charles University so we are based in Academia uh we are like a rather small group and I think like one one way how to look on it is like one of the generative intuitions is um like we are trying to look on like questions which will be relevant or like which seem relevant to us if the future is sort of complex in the sense we have in the name that you have multiple different agents like you have like humans you have AIS you have systems where both humans and AIS have some like nontrivial amount of power and so on and I think like traditionally a lot of uh lot of alignment work is like based on like some like simplifying assumptions for example like let's look on some like idealized case where like if like one principle who is human and like you have like one AI agent or AI system and now like let's work on like how to solve the alignment relation in this case and basically so so my impression is like assumption like kind of like abstracts away like too much of the real problem like for example the I think like the problem is like like kind of like self unaligned parts or like conflicting desires like will kind of bite you even if you are trying to solve like realistically this like one AI one human problem like the the human is like not like internally like aligned agent so it's a bit unclear like in principle what the should do but like overall the like one of the kind of intuitions behind ACS is like yeah we expect more something like ecosystems of like different types of intelligence also like empirically it's not um again like historically I think a lot of a safety work was like based on like models like okay like you have the like the first like lab to create like something which maybe is able to do like some like self-improvement or like yeah you have some like Dynamic where I would say in in a lot of the pictures kind of the like a lot of the complexity of like multiple parties multiple agents A lot of it is kind of like assumed to be solved by the kind of like overwhelming power of the of the first like really powerful which will then like tell you how to solve everything or like you will be so powerful that like everyone will follow you will you will you will form Singleton and so on I don't know my impression like is like I don't think like we are like on this trajectory and then the picture where you have complex interactions like you have hierarchies there are not only humans but like various like other agented entities becomes uh becomes important and then then I think the question is like okay like kind of like assuming this like what are the most interesting questions and I think like currently ACS has like uh way more interesting questions than capacity to work on them one one One Direction is like what we talked about before like the hierarchical agency problem like roughly like okay like agents composed of other agents like how to formally describe it I think this is for us it's a bit of a moonshot project like inventing like again like I think the best possible type of answer is something like Game Theory type and inventing this stuff seems hard it took some of the best mathematicians of last century to invent it so so I think like there is something like deceptively simple about the results but it's it's difficult to invent them so but I think like if we succeed it would be like really sweet but I think like there is a bunch of like more like things like which we are trying which are like more like trable or like it's like more clear clear like we can make some progress and like one of them one of them is some research on how to describe like interactions of like boundedly rational agents which are like bounded in ways which we believe are like sensible like what's and at the same time the whole frame has some like nice property so so that's like slightly less theoretical or like slightly less like nebulous but there are also other things which we are we are like also like working on like pretty like Empirical research just in this like complex picture uh with in like Smooth like takeoffs like what becomes quite important are like interactions like of like systems so like another thing which we are like thinking about or working on is like okay like like you have like AI systems and there's a lot of effort we like going into like uh understanding their internals but if I describe it using a metaphor it seems like I don't know like mechanistic interpretability is a bit like Neuroscience like you are trying to like understand the individual circuits then there is stuff like science of deep learning or like kind of like trying to understand the whole dynamic but um I think in in kind of composite like complex systems composed of many parts like one of the insights of uh Fields like Network science or like you statistical mechanics is like sometimes like if you have like many many interacting parts like what's uh like the nature of the interactions or or the the structure of the interactions like can have a lot of weight and like you can sometimes like abstract away like details of the of the individual systems and this is also true for some like designed like human designed like processes like like if you go to a Curt and like you see someone there will be a judge and so on and I I think like the whole process in some sense like is some system which tries to like ract like you can for like mod the process like you can often like abstract a lot of details about the participants or the like you don't need like what know what type of coffee the judge likes and so on so I think like here the intuition is like okay like in in reality like in like Smooth takeoffs like we expect a lot of systems where like a lot of interaction will move between humans to between human and Ai and like Ai and Ai and yeah this could have like impacts for the like kind of for the dynamic of the composite system also like understanding the nature of the interactions like seems good it's a bit like how some sort of Sociology of like like how like if you look on like research how like groups of people behave or like how this function it seems like it's often fruitful and can like abstract away a lot of details about like human psychology also I don't know I think like there's a lot of questions here which you can answer and like it's enough to have access to like the models like using apis and like you don't need to have like kind of like hands on exist rats and so on so sure so one thing this um this General perspective kind of reminds me of is this uh different group um like principles of uh intelligence in [Music] biologic and something social systems pbss yeah principles of intelligent behavior in biological Behavior biological Grand Social systems yeah so I'm uh what's yeah do do do your groups have some kind of relationship um yeah they they have a lot so like I was uh yeah pbss Pips was like originally founded by like Nora Aman and uh TJ and uh anaga and uh Nora uh is a member of our group she was at some point uh a halftime research uh manager and like uh researcher at ACS while also like developing Pips and um uh currently she moved more to work on Pips but like continues with us as like research affiliate yeah so it's like uh it's a very nearby entity I think the I think like there is a lot of overlap in taste I think uh in some sense Pips is aiming for a bit broader perspective like ACS is like more like narrowly focused on on stuff where like we can like use insights from like physics Mass complex systems machine learning like we would not Venture into I don't know like legal systems or or so like in some sense like this is like BR umbrella other thing is um when creating ACS I I think like we are like trying to like U build something more like a research grp where like most people work like I don't know basically as their job while Pips was originally mainly like the form of Pips was a bit more similar to I don't know like mods or like some some program where like people go through fellowship and then they move somewhere or so so like ACS is like trying to provide people like some sort of like institutional home so that's that's also some some difference like like gradually I think like Pips moved more to some like structure where they also have like fellows who like stay for like this Pips for like longer time and so on but I think there's some like still still like notable like difference in in the format where sure yeah yeah and if people are interested uh recent episode um suing labs for AI risk with Gabriel wild um that work was done during a pib's fellowship as understanding yeah so so yeah so I think it's like exactly a great example of work which like can be done in the frame of like PS fellowship and is not something which like ACS would would work on also I think like the projects like we work on are in some sense typically have like bigger scope or are like like probably a bit more ambitious at like what you can do in the scope of of the fellowship yeah so speaking of things you're working on um I think I guess earlier you mentioned something was was it about kind of talking about hierarchical agency using active inference or am I misremembering yeah so I would say I don't know I I'm I don't know like a bit vaguely speaking I think like I'm like more hopeful about like Mass like adjacent or like broadly based on active inference as a good starting point where to develop uh like the formalism which would be good for like describing like her agents but I don't I would not claim like we are there yet or or something also I don't know like I think like some of the like I think on the boundaries of the of the active inference Community maybe not exactly in the center are like like are some people who are like thinking of these hierarchical structures in in like biology and and again like it's what I said earlier like I think like the field is not uh so like established and cized so I can like point exactly like here is this boundary of of of this community yeah but I I think like it's more like we are like taking some like inspiration from the mass and like we are like more hopeful that this could be a good starting point than yeah some other pieces of mass but um sure yeah and is that um kind of the main type of thing you're working on at the moment so I think like TimeWise like openly speaking I think like we are currently slightly overstretched in like how many things we are trying to work on but uh like one thing we are working on is yeah just trying to like invent the formalism for her agency I I think this is like a bit difficult thing to like I think like it would be difficult for us to be the only thing to work on so yeah like mainly like so like my my collaborators in the group are like toas kavak and like Adam clam and like nor Aman so I think like time wise like we are probably currently splitting time mostly between uh like trying to like Advance some like formalism of the bounded like inter ctions of like some like boundedly rational agents who are like kind of like active inference shaped and uh basically empirical studies of like llms uh where we have we have uh experiments like like llms like negotiating like or aggregating their preferences and so on and this is this is uh empirical it has it has in some sense like very like fast like feedback loop like you can make experiments like you see how it goes so we hope uh both in case like we kind of like succeed on the theory front like this would like provide us with [Music] some like playgr where we can like try things but also we we just want to like kind of like stay in touch with the latest technology and also I think like this is uh this is a topic like like I would wish like more people like worked on if there were like dozens of groups like studying like how you can have like llms like negotiate and the negotiations and know like what are some do for the negotiations like how to make the process like non- manipulative and like similar things I think this there's like very low bar in like trying to like in like starting like work on topics like that like you basically just need like API access you need to like Rec create it in housee some like framework which like hopefully makes it like easier to run experiments like that in scale and like kind of do does some like housekeeping for you it's called interlop so so I think like there's like low bar in like trying to like understand like these interactions but um it's just um maybe at the moment like not as popular topic as some other directions so yeah we are working on that as well but we hope like this will grow and it's also kind of adjacent to the whole like to the like I think like another Community SL like brand in the in this in this space is like Cooperative Ai and like Cooperative AI Foundation we are also collaborating with them and yeah this does seem like the it seems like kind of thing that Outsiders are like kind of having comparative advantage in you know like academic groups like like this kind of research is of you know trying to get language models to cooperate with each other you know looking at their interactions like you can do it without having to like train these you know massive language models um um and like like I think there was I think there was some work done at um my you know old research organization chai um last year I'll I'll try to provide a link in the description of what I'm thinking about um but yeah work work on getting um language models to negotiate contracts for how they're going to cooperate in yeah playing Minecraft so so we are more like like kind of I think like there are like many different setups like which are like interesting to look into like specifically we are sometimes like looking into something like you imagine like I don't know like the humans kind of like delegate the negotiation to to like AIS and then the question is like what happens and I think this is also like kind of empirically like very will become like empirically very relevant very soon because like people like I would expect like this actually already kind of like is like happening in the wild it's just not very but like you can imagine like people like I don't know like a assistant like negotiating with like customer support lines and these are often on the back end it's also like a language model and I think there are some like interesting things which which make it like somewhat more interesting than just just studying like the I don't know single user and like single AI interaction like like for example if you are kind of like delegating your negotiation to your assistant like you don't want the your negotiator to be kind of like extremely helpful and obedient to the other negotiator like like if I'm like I don't know in our if if like one of the to models we use is like car sales so if the other party like Bots tells your Bots this is like really amazing deal you just like must buy it like like otherwise like you will regret like you don't want the llm to just like follow the instruction and like yeah there are questions like often we are interested in something like how do the systems like how do the properties of the system like scale with with like scaling models or yeah I mean I think there is a lot of stuff where like you can have very basic question and and you can get some like empirical answer and uh it's not it's not yet done sure so if listeners are interested in following um your research or acs's research how should they go about doing that probably the best option is like uh so so one thing is like we have web page we have like ACS research.org uh we tend to when we publish like more like like less formal like blog post posts and so on we tend to cross post them on alignment Forum or like similar Vu so so one option for like following our like more like informal stuff is just like follow me on on alignment Forum or less long um we are also on Twitter and uh yeah we are we also like sometimes like run like events specifically for for like people like kind of communicating about on the intersection with of active inference and a alignment there is some like dedicated slack to it but overall probably the the standard means like following us on on Twitter and and Alignment Forum like currently works best all right well um thanks very much for coming here and talking to me thank you this episode is edited by Jack Garrett and Amber Don held with transcription the opening and closing themes are also by Jack Garrett financial support for this episode was provided by the long-term feature fund along with patrons such as Alexi maaf to read a transcript of this episode or learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about this podcast you can email me at feedback at axr p.net [Music] [Laughter] [Music] [Music]