Library / In focus
AXRPCivilisational risk and strategy
Shard Theory with Quintin Pope

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Shard Theory with Quintin Pope, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Showing 140 of 156 segments for display; stats use the full pass.
StartEnd
Across 156 full-transcript segments: median 0 · mean -3 · spread -29–5 (p10–p90 -10–0) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.
Slice bands
156 slices · p10–p90 -10–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 156 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrpcore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video o-Qc_jiZTQQ · stored Apr 2, 2026 · 5,206 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/shard-theory-with-quintin-pope.json when you have a listen-based summary.
Show full transcript
[Music] thank you hello everybody in this episode I'll be speaking with Quentin Purp Quinton is a PhD student at Oregon State University where he works with xiaoli Fern on applying techniques for natural language processing to taxonomic and metatronomic data he's also one of the leading researchers behind short Theory a perspective on AI alignment that focuses on commonalities between AI training and human learning within a lifetime throwing store or discussing you can check the description of this episode and you can read the transcripts at axrb.net alright Quentin welcome to the excerpt I'm glad to be here yeah so the first thing I want to ask is so I guess we're broadly going to be talking about Shard Theory and The Shard Theory perspective and like a sort of like research methodological I guess assumption behind Shard Theory research is that it seems like you think that focusing on the processes by which humans develop like desires or you know values or something I get the impression that Shard Theory people think it's like really important to study this to understand AI or AI alignment or something I'm wondering if you could tell me a bit about like why The Shard Theory methodology why look at how humans form these desires well I think they're like at least two perspectives under which This research direction is a good thing to pursue one is that we want to like align AI is to R values and just generally it seems like a good idea to have some idea of like what human values are how do they arise like computationally speaking how might they change in ways that we like do or don't desire don't reflectively endorse and so on and so forth and so just from the general perspective of like we want the ai's behavior to match these things we call values in some way that we're not really sure about and we're not really sure about the value what the values are and so on and so forth it seems like good to have more clarity about what our Target even is so that's the kind of like weak reason for wanting to investigate human value information which is like relevant to a broad range of perspectives on AI alignment so like in addition to thinking understanding human value information is good for that reason I also have like a strong or more aggressive reason to think it's good to understand human value information which is that I think the best way to align AI is to have the AI like as much as possible form values using the same underlying process as is responsible for human value formation and the reason I think this is because if you're uncertain about a thing and uncertain about like how to quantify a thing but you want to reproduce the thing and even like reproduce aspects of the thing that you personally don't understand it's best to try and replicate the causal process that caused the thing to emerge in the first place so a illustrative example to better visualize the point I'm trying to make here might be like suppose we had a classical painting say and we wanted to produce the best possible replication of that painting like one way to do to go about this is to try to reproduce the causal process that caused the painting to arise in the first place which would be to like have a person look at the painting and try to paint a another painting that's very similar to it another way would be to use a totally different sort of causal process such as say taking a photograph of the painting and then having a printer print out that photograph and as far as like the things you can observe or the things that you can measure explicitly in order to incorporate into what's recorded in the photograph like that second process will definitely be better it will like better replicate the exact pixel distribution of the painting question and so if you know everything about like what you want to replicate in the thing you're looking at and you're confident about all these unknown unknowns about the thing in question then like the picture approach would be better because you can get like a more precise replication however that's not like our epistemic position in regards to alignment and human values like yeah the thing that causes problems are like the unknown unknown aspects of human values where there was some aspect of like what we want or how our reflective process works or other mysterious quantities about human values that we didn't know we had to replicate or we didn't know we had to make the AI respect in some particular way this is like related to aliasers like fragility of value and boredom and like if you just miss boredom you're completely it's completely done so like if you think about the painting example like maybe the human making the replica of the painting isn't going to have as good a pixel by pixel replication of the image that a machine would have captured but they are going to have partial matches along Dimensions that the machine had like zero match along so for example maybe it turns out that the relevant aspect of the painting to replicate isn't it's as much its visual appearance as say the texture of the painted brushes or the smell of the paint or like whatever aspect we didn't think of at the time actually turns out to be quite important or like lighting conditions or like the angle at which yeah that like yeah and so like trying to replicate the causal process by which an artifact appeared puts you in a better position to reproduce aspects of the artifact that you didn't know you should be targeting and I think this is particularly a particularly useful Quant thing to have in the Deep learning regime where we have loss functions in SGD which will let us like provided we can quantify something in a way that's accessible to expressing in a loss function we can usually replicate that desired Behavior given sufficient scale and sufficient training and like sufficient resources so it's like the unknown unknowns that concern me and I think replicating causal processes which produce those unknown unknowns gives us the best chance of getting like close enough along Dimensions we're not certain about I wonder like so one thing that the analogy kind of brings out so I suppose you can like take a photo of an artwork or you can like try to paint it yourself one advantage of taking a photo is that like I can get a camera that works really well a lot more easily than I can learn to paint really well right like like somehow the camera has really like fine grain control over like the details of all the pixels and so I'm wondering like do you think that there's maybe maybe this is just pushing the analogy too far but like by doing something more like an engineering like top-down approach it seems like potentially like the aspects that you can measure you can maybe have like more fine-grained control over yeah what do you think of that yeah I think the analogy does break down at that level because like the thing it's analogous to is like the Deep learning rl-esque approach to producing an intelligent agent versus the basically not really existent sort of I don't know what would even be the alternative at this point like approximation to XC like some sort of explicit pure direct optimization search procedure so like solving agent foundations and then trying to build an AGI how exactly so if you like compare the degree of research how quickly research has progressed in terms of like getting deep Learning Systems to do stuff versus getting anything else to do stuff like we're not even at the point where anything else can do stuff at all much less like the stuff we in particular want so in your mind the process of human value formation is basically just deep learning like enough that you're willing to sort of like in your analogy you're like oh yeah we need to study human values and you know recreate them just like a painter and then you're like oh yeah the the painting is just doing deep learning yeah at this point I think I am comfortable saying that the Workhorse underlying the human learning process is basically lost minimization slash reward signals yeah one interesting thing in this regard or one piece of evidence and destruction is the kind of startling degree of convergence we've seen between how state-of-the-art deep learning systems work versus how the human brain works and so like in the human brain you have well it's basically like self-supervised prediction of incoming sensory signals like predictive processing that sort of thing in terms of like learning to predictively model what's going to happen in your local sensory environment and then in deep learning we have like all the self-supervised learning of pre-training in language models that's also like learning to predict a sensory environment of course the sensor environment in question is like text at least at the moment but we've seen how like that easily that same approach usually extends to multimodal Text Plus image or text audio or whatever other systems and then like most of your cognition comes from that whether you're a human or a deep learning model and then there's a little bit of RL on top of that which is like directing that cognitive bmoth in useful directions using a relatively small amount of fine-tuning data and just very recently there's actually been an advance out of anthropic in terms of pre-training that further Narrows the gap between what the human brain does and what and how at least they train their language models which is that anthropic release the paper something like incorporating human preferences into pre-training or the lake and what they did is instead of doing all self-supervised pre-training at the start and then a bit of RL at the end they basically like mixed these things together and did what is essentially RL uh throughout the entire language model pre-training process and this brings it like closer to what the human brain does in terms of our reward circuitry continuously providing supervision over our cognition throughout our entire lifetimes instead of just like only activating after your childhood say which would be absurd for a biological organism all right so I think I actually want to defer talk about how like um you know AI learning is gonna how you think it is gonna look like pretty similar to human learning within a lifetime I guess like first up we're talking a lot about understanding how humans like form values yeah first of all what do you think about like what do you mean by value like what aspects of humans are you aiming to like try and understand and capture yeah so I think there are basically two ways in which a value humans have values one is they have like priors over their generative model and the other is they have like discriminative preferences over plans so like humans have this thing where uh say if they're thinking about how like if you entered have a population of beetles say and you introduce a selection pressure to minimize the total population or like you try to replicate the group selectionist Paradigm in beetles and you do an experiment where you like punish groups of beetles for breeding too quickly and having too many like too large a population and you try and introduce like group level selection on their genomes towards um like restraining their own breeding if you ask a human like what is this going to result in they're like generative prior over what sorts of adaptations would be appropriate is going to like reflect certain types of value-laden judgments and they're going to be like the Beatles will nobly restrain their own consumption of resources or their own reproduction or that sort of thing and that's the first thing that occurs to them in terms of like what's appropriate to do in this situation okay even though they're like not the correct prediction in terms of like evolutionary psychology evolutionary biology and generally like people thinking about whether how to get someone's mother out of a burning building are going to be first like generating plans which are significantly biased towards the sorts of action sequences or outcomes that like respect other types of human values so they're going to the first things that's going to think they're going to think about is like the fire uh Team rescues the person's mother and like unless they're explicitly thinking about like monkey paw curling and that sort of thing they're probably not going to think oh the building explodes and her corpse comes flying out so humans when they're like generating plans have this tendency towards generating plans that are like plausible human-like plans well that's like tautological but like there are certain aspects of the sorts of plans that humans tend to produce which are biased towards achieving various parts of human values and not like egregiously interfering with those values or damaging people so that's like an aspect of the generative prior which biases plans another that's one part of values I think and then another part of values is like a part of the discriminative preferences or like there's this thing humans do when we're considering plans or considering outcomes where we can say this plan is better than this other plan or this outcome is preferable in some respect to this other outcome and so when we have like a list of different outcomes or plans or World States we can order them according to something so this is what I'm calling like discriminative values where then you can have like these two things working together or you have a generative process which is already biased towards like good plans or quote-unquote good plans relative to our perspective and then a discriminative process that's like working over those plans or like comparing different plans to each other uh then selecting plans that are further in the like goodness aspect whatever that means and you can even do like yet clever things where you have some latent mental representation of a plan and then you like provide feedback from your discriminator from your internal discriminator to continuously shift that plan and the good plan Direction and like all of these of course have analogies in what we can do in deep learning sure right so so I mean one question I have is when you say like there's some aspect of of like you know imagining what things I might want to do that's related to my values I think that's true I think also part of like how I imagine things that I might do is like stuff I've done before or like things that would be easy for me to do or it seems like a bunch of that is not particularly valuated in the in the way that I think of values as being important for something like AI alignment do you think there's like a structural difference between like in a way such that you could tell like oh this part of my generation of like ideas is like to do with values and this part isn't sort of so I think that part of this is like socially determined where we humans give each other feedback on the sorts of preferences that we'd prefer each other to have and so there are like some parts of your planning process that from like a game theoretic perspective it's better to like move to the Forefront or say this is more representative of the kind of agent that I am as opposed to like your desire to minimize your own resource expenditure and things like that and I think this sort of feedback from other people like shapes our perspective of what's appropriate to have in our own minds and so there are like some parts of our planning process that we more strongly endorse than others and also there are like or like partially as a consequence of social feedback and partially as a consequence of what you get when you like feedback those sorts of learned preferences over preferences and have them analyze themselves so to speak like you find that there are certain parts of how you plan that are that you want to retain more so than other parts like you want to continue being a good person in the future but you don't necessarily like want to continue being someone who's better at programming in Python than JavaScript in the future so if you have like a pill that makes you a better JavaScript programmer than a python programmer by improving your JavaScript programming like that's not something or that's something you would be fine like reflectively endorsing that sort of change in yourself and then there are other other changes of course that we don't reflectively endorse so that's like one access along which the things we call values tend to differ from just like other aspects along about how we tend to think so I think it's a mix or like the two biggest factors that differentiate these things are like socially endorsed versus and also like internally endorsed versus not okay okay so hopefully this is like a good way of like us keeping in mind what thing we're supposed to be explaining like maybe with Shard Theory or something else I'm jumping over a bit but I'd like to go a little bit back to this idea of like the way we're going to make AI it seemed like you thought you thought that it would be good to sort of figure out what made humans have human values and kind of replicate that process I think there's this alternative perspective um of alignment where the idea is you have humans around and then you have some process which like takes a you know some AI you know take some fledgling AI maybe you know maybe in the process of training or whatever and like pulls it closer to what people want such that you know whatever you wanted you could like pull the AI closer to that so that it was like trying to achieve what you wanted it to achieve and it seems like this kind of like it has the advantage that you're less likely to forget an aspect of what you want because all your desires are still presence or what you want for the AIS around in you and it also seems a bit more on an engineering gee like I don't know for a bunch of artifacts in the world like often a thing that happens is you like imagine what you want out of the artifact and then you designed that to have that property and yeah I'm wondering if you could say a little bit about like how you think why you're less optimistic about these sort of aligned to some quasi arbitrary gold someone has plans yeah so I kind of like don't think such a process even exists oh why not I think that the like entire idea of there being a goal or such a thing as a goal or like insofar as goals exist in the world they're more or less like exclusively artifacts produced by like neural slash deep learning systems that like gpt4's goal is not to minimize predictive loss like that's the loss function with which it was trained but I mean so far as it ever has goals they're not to do that unless you like prompt it in some really weird way yeah so I kind of like not very strongly but like somewhat think that the sort of thing you're gesturing towards like being able to align something to an Arbiter a goal is not a thing that exists and also I like don't see a particular reason to think it should exist as like people came up with the orthogonality thesis Say by introspecting on themselves and by like thinking in frames that very much did not predict deep learning the orthogonality thesis being the idea that smart things can have basically arbitrary goals right yeah and so like I don't put much stock on either of those methods it seems to me like after having studied deep learning quite a lot it seems to me that deep learning operates on different intuitions than the rest of the world okay it's sort of like quantum mechanics where people try to come up with various analogies for how people you should think about quantum mechanics and describe it in terms of like various classical things but nothing in the classical world is really like quantum mechanics in the way quantum mechanics mathematically actually is and I'm guessing that like deep learning and including the human neurological learning process is in this category of things which are very very difficult to reason reason about via analogy especially if you're like going into things trying to reason from analogies while not even knowing that like how deep learning works and that it does work and not being like familiar with empirical results from Deep learning so I'm extremely skeptical of the sorts of intuitions that I perceive as being behind the there should exist I think a way of producing an intelligence directed in an arbitrary access but um I mean with your functionality thesis you mentioned that you believe that the reason people thought it is by like introspection on their own minds if like human brains are like made by processes that are like really similar to deep learning and AI shouldn't we think that that's actually like a pretty good methodology ironically like if you weren't aware of deep learning you might think that or if you weren't aware of like various empirical results from Deep learning you might think that but just like think about how bad gpt4 is at introspection like if anything it's like way less introspective introspectively aware than a human and like in general I don't think that deep learning systems are the sorts of things that you'd expect to end up with good introspective awareness of how they themselves work and also like the issue with introspection is that it's operating at like the wrong level of on the abstraction hierarchy or so the question of orthogonality thesis is a question of like in the Deep learning framework at least it's a question of like what happens as a result of training different sorts of AI systems on different sorts of in different in various different ways it's on that like level of abstraction but introspection for humans like if you're introspecting in a moment in time like this is the Learned artifact of your neurological learning process okay not like the learning process itself so like in deep learning speak you're looking at the activations or the activations are like quote unquote trying to look at themselves but what really matters is like the training process trajectory of SGD operating on the weights of the neural network not like at the activation level so it's the idea that like we just shouldn't trust our own introspection like as a way of imagining what it would be like if we were formed in a different way um yeah you shouldn't trust it like that it's not the case that like introspection provides zero evidence it's just that you shouldn't necessarily like believe the results of introspection like the literal results of introspection there was a neat example uh if you work with chat GPT say and ask it questions about itself it's very often like completely wrong about how it thinks but like its answers still provide evidence that's relevant for inferring how it thinks all right I guess I'd like to move on a little bit to um talking more about these background assumptions about human brains that are as far as I can tell going to underlie your understanding of human value formation so yeah I I've read some of your work and first I'd like you to I'd like to say what I think the assumptions are as I understand them and you can tell me if I'm like wrong or like adding things that shouldn't be there or missing important things all right go ahead so yeah this is this is based off uh I think a section of the post on Shard Theory but I've modified it a bit to include things that I thought were important which is why I want some feedback so the first assumption is so there's this thing called a cortex in your brain and this is like basically responsible for making decisions about what to do and when you're born like you get a cortex that's kind of randomly scrambled like it doesn't have like very important structure and in particular your genome isn't specifying the structure of your cortex very much so like human values and biases aren't really accessible to the genome is that a fairer statement yeah so like for one you have many courtesies oh okay yeah and like the genome does specify how they're structured at Birth it's just that like that specification doesn't have very much information content in it so it so it like tells you the initialization in ml speak okay and there are like patterns and structures in there but they're mostly not storing that much behavioral information like there is some direct behavioral information in terms of neural circuitry which is specified exactly by The genome such as say keeping your heart beating okay but for the most part say an understanding of language isn't in there so so when you said there were multiple cortices which um which cortex is the are we talking about all the cortices are we talking about some important cortices here do you mean like I think if the short Theory post it says something like the cortex is randomly initialized at Birth I think Alex was using that phrase to mean like cortical matter other like there are different regions in the brain and they are genetically specified to varying degrees okay prefrontal cortex is mostly like not genetically specified or not like low pre-loaded with genetically specified behaviors but there are like other things that are more strongly specified such as the circuitry that keeps your heart beating like I mentioned probably like my guess is that there are uh pain response hardwired like pain response behaviors that cause your hand to Flinch away from uh painful stimulus so okay so my current understanding of the claim is something like your brain has a bunch of parts The genome specifies something about the architecture of your brain but it does not like set very many behaviors at least none of the behaviors that we think of as like intelligent or like you know any of the interesting behaviors the behaviors it sets are like you know the fact that you can breathe without thinking about it really or your heart beating or something like that or reflexes is is that roughly Fair yeah pretty much okay cool I mean there's like eyes tend to track faces more than other stuff but yeah all right so the next thing is that there's a sense that one of the things that the brain does is self-supervised learning which I take to mean something like learning to predict what it's going to see next or what the what sensory data is going to receive does that sound about right yeah so like you have a bunch of Senses about your environment and also your internal State and so on and so forth and the Brain tries to predict all of these things simultaneously and also tries to like predict them at multiple time scales there are a bunch of like generative slash predictive models that are continuously contrasting their internally generated predictions of the future with the actual incoming sensory information about the future and some of these are like oriented towards your internal bodily sense such as some are probably predicting whether you'll get sick soon and others are like oriented towards the external sensor environment like what your visual system is about to perceive in the next second or so okay and so on and so forth and so like your brain somehow has some representations of like like all the various sensory data it's gonna receive at like various time scales and there's some learning process to make those representations accurate or like maybe they're like probability distributions instead of like very very simple representations does that sound right it's like closer to say I think that they have complicated representations I think they're probably not like factored in a way that makes it in a way that you'd easily be able to look at it and say oh this is a probability distribution okay so in what sense are they complicated like they Implement very complicated Target mappings between like what I previously observed versus what I predict to observe in the near future they're complicated in the sense of like gpt4 contains a bunch of extremely complicated learned circuitry they're complicated in the sense that like the representations track very intricate and not easily predicted aspects of the physical world around you and the ways that you interact with that world and so on and so forth okay so the next assumption is that the brain also does reinforcement learning in the sense of I guess I don't know what reinforcement learning algorithm but but some some kind of learning based on like reinforcements that like you know you got some kind of reward and that like reinforces like whatever you were doing right before you got the reward and in particular the reward circuitry is like some part of your brain it's genomically specified and it's like very simple it's something like are you detecting a bunch of glucose molecules or maybe like your brain has some hardwired face detector and it's like do you see a smiling face or something like that does that sound about right yeah um I'd say like uh so there are like two aspects to it one is a bunch of very simple hardwired genomically specified reward circuits over stuff like your sensory experiences or simple correlates of like good sensory experiences there's also what you might call like a learned value function which is sort of like what Behavioral Science calls like learned reinforcers so if your environment contains like sensory correlates of uh something that produces reward then like the things that predicted the reward incoming eventually become reinforcers themselves so like when you're training a dog initially you reward it with Treats but you always Maybe have a sound prior to giving it treats and eventually the sound becomes rewarding in and of itself and by rewarding you mean in the sense of like a reinforcer that like reinforces uh I I guess you're thinking of computations that happened like right before the event yeah Okay cool so I have a few more questions about that but before that I'll get to the other somewhat more implicit assumptions so one thing I guess you've said it in a few places but I understand the the kind of chart Theory perspective to be that Evolution and the genome they sort of like specify this setup but they don't specify like beliefs or pieces of knowledge or preferences um they just kind of like set up the learning process is that right yeah pretty much okay I suppose you can call reflexively moving away pulling back from hot stoves say like a sort of preference but I don't think the genomically specified behaviors are much more complicated than that okay all right and then finally so here's like something that I read as an implicit assumption to kind of chart your conclusions which is that like these are all of the relevant processes for human within lifetime learning like once you understand um the self-supervised learning and the reinforcement learning you basically understand how humans learn within their lifetimes does that sound right to you I think it's like most of it but there are of course like other factors as well so for example I think that probably infants have an attention level bias towards paying more attention to face like structures than other parts of their visual field and this is not like cleanly part of the reward system or part of the like self-supervised objective it's like a sort of tendency to move your eyes in particular directions which then feeds back into the sorts of things that you tend to learn about more than other kinds of things and so it's like a lever which influences the types of values that are statistically formed by this kind of learning process but it doesn't like neatly fit into the language of like reinforcement learning or self-supervised learning directly and there are other factors as well such as architectural facts about the brain and like what sorts of brain locations are closer or further from each other which influences the sorts of like algorithms that are easier to represent in the kind of architecture that the brain does have and so these probably like Downstream influencers your values in various statistical patterns and there's also like the level of noise that your neurons have and research on deep learning inductor biases shows that like the noise structure of your Optimizer over time influences the types of generalization patterns that you tend to learn so yeah there are definitely other factors as well okay so it would be fair to say that you see those factors as like as like important insofar as they influence how the reinforcement learning and the self-supervised learning work like maybe the factors are like biases in the value neutral sense like biases in the how the self-supervised learning goes or like hyper parameters of the RL or something yeah for the most part yeah they're also like various genomically specified like rube goldberg-esque thingies like your genome has various tools available to it to like change your behavior mid-lifetime so like there was an experiment with salt depart salt deprivation and rats so like what the experimenters did is they took these rats and they sprayed extremely salty water into their mouths and this is like quite an unpleasant experience for the rats and so they ran away and like didn't like the sprayer that sprayed them with salty water and then the experimenters then injected the rats with a particular compound which chemically imitates the physiological effects of extreme salt deprivation so it caused the rats to be like biologically think that they were very self-deprived and the Rats immediately went to the Salt sprayer and like were very interested in it and tried to get salt water out of that sprayer even though all the rewards they'd gotten from the salt water being sprayed into their face were like very negative at the time and like what I think happened in like what Steve Burns thinks happens when he wrote up a lesser wrong post which was like salt deprived rats and their implications for inner alignment or something like that yeah and so basically you have like predictors of various sensory signals hardwired so the genome like specifies that there are going to be these predictors of various hardwired sensory signals and one of those Sentry signals is like salt water in my mouth and so throughout the rat's lifetime it's like learning to predict the occurrences of salt water in their mouth and what this does is it's like a learned pointer to the world model so the world model learns that there's this dispenser of salt water and the genome has this it now has this like pointer to this part of the world model that this dispenser thing has salt water in it and so when the genome activates a genetically hard-coded algorithm which is like we are now salt deprived we now need salt water Yep this connects with the part of the world model that the rat has learned about where salt is in its environment and the genome can like paint this thing with positive valence in order to direct the rat's Behavior towards the salt water now that it needs salt in its diet even though no part of the reward that the rat received within its lifetime said like salt was good and no part of the rats like within lifetime behavior is telling it that it should pursue salt water when it's salt deprived gotcha so if I take this as being relevant to human brains and human cognition like it sounds like the thing that's going on like the type of reinforcement learning that's going on can't just be like like you get a reward signal and that like you know strengthens or weakens certain like behaviors or certain types of thinking like like it it seems like it must be the case that like there's something being stored somewhere which is like oh yeah these kinds of thinking got this type of reward and then like you know there's some process that's like dressed you're thinking either to be like things that got that type of reward or things that you know got the negative of that type of reward if that's what's needed right so like if if you're getting salty sprayed with salty water like some bit of you is storing like oh yeah the reason that was like these like behaviors were associated with saltiness like ordinarily I'm going to like try to avoid saltiness but sometimes I'm going to like go towards saltiness and then I'm going to like pick those behaviors does that sound right to you yeah so they're like various probably fairly simple genetically specified algorithms and conditionals for like that activate to implement behaviors that are like very hard to learn within lifetime so like you don't want your rats to have to go through periods of extreme salt deprivation and then explore around their environment until they find Salt and in order to like learn salt deprivation needs salt to remedy uh because they'll like die before they find the salt and discover this relationship so the genome definitely does have these like conditional circuitry for certain types of like ancestral relevant survival Behavior yeah well in this case it's not quite a what like it's kind of an abstract Behavior right like like in the in the ancestral environment there wasn't a guy in a lab coat he could go to to get salts right yeah but there were like salt deprivations yeah and so so the thing that must be happening is like it's the thing being coded is like get a thing that you stored as like having salty rather than like a very kind of just direct Behavior yeah so it's like setting up a learning configuring the learning process to learn like pointers between what the genome can quote-unquote see versus like the learned understanding of the world and that lets the genome point the rat in various directions when it's relevant for behaviors that you would die when you try to learn them in the real world or that you don't have like time to learn within your lifetime and stuff like that sure so I mean I guess the reason that I'm focusing on this is to me this sounds different than at least what I think of as reinforcement learning into very simple just like you have this reinforcer and it always reinforces Behavior right yeah so I don't really think of it as that different like like all behaviors are sampling from some distribution and the thing about distributions is that you can condition them in various ways and like conceptually speaking the normal sorts of reinforcement learning we do with just one type of reward signal it's just like sampling from some distribution of behavior conditional on that heat on the samples having hype reward assigned to them and so what the brain is doing is like it can condition on different sorts of things so instead of just conditioning on like this Behavior gets high reward whatever that means it's like this Behavior gets high reward as measured by the salt detector like in the context of language modeling we can do exactly the same thing or like a conceptually extremely similar thing to what The genome is doing here where we can have like incorporating human preferences into pre-training paper I mentioned a while ago where they actually like technically do is they label their pre-training corpus with like special tokens depending on whether or not the pre-training Corpus depicts good or bad behavior and so they have this token for okay this text is about to contain good behavior and so once the model sees this token it's like doing conditional generation of good behavior and then they have this other token that means bad behavior is coming and so when the model sees that token or actually I think they're like reward values or like especially typed classifier values of the good nurse or Badness of the incoming Behavior but anyway what happens is that you like learn this conditional model of different types of behavior and so in deployment you can set the conditional variable to be good behavior and the model then like generates good behavior and you could imagine an extended version of this sort of setup where instead of having just binary good or bad behavior as you're labeling you have like good or bad behavior polite or not polite Behavior academic speak versus like casual speak you could have like factual correct claims versus fiction writing and so on and so forth and this would give like the code base all these learned pointers to the models quote unquote within lifetime learning and so you would have these various control tokens or control codes that you could then switch between according to whatever like simple program you want in order to direct the model's learned behavior in various ways and I see this as like very similar to the sorts of things that the brain's steering such genomically specified Behavior control system is doing within lifetime okay I wonder um I think part of this there's this idea um that kind of the relevant process of learning is happening over a lifetime like it's the self-supervised learning about and the reinforcement learning and that the genome is specifying like basically like mostly architectural facts and some very low level facts like I don't know flinches or reflexes stuff like this and I'm wondering like if I look at humans there's kind of this literature on which things are kind of quote unquote genetically innate or or at least have some genetically innate component versus like totally specified by the environment and like a standard finding is that um you know you can like take Twins and like separate them out and adopt them to different places and a standard finding is that they're very similar in a bunch of ways like they have a lot of similar like preferences um maybe they'll be like you know maybe they'll go to church with similar frequencies or stuff like this I think at face value this seems like it's kind of this seems kind of contrary to this picture so I'm wondering like how yeah how would you like reconcile those yeah so I want to be clear that I'm not arguing for like a blank slightest interpretation of human learning okay it's not the case so like we just had an entire discussion about how The genome has these various levers of control for your learned behavior that can let it steer your behavior in various ways depending on the circumstances you're in so that's like one way in which your genome has options for determining what sort of person you are and how you act and the other thing that the genome can do is it can specify your reward circuitry and so like to give a very simple example say humans have different types of reward circuitry or the reward circuitry of different humans like judges tastes in different ways so like if you have reward circuitry that tends to very highly upweight like sugar consumption then you're more likely to like sweets in your lifetime and this is like I genetically determined fact about your learning process which tends to influence the sorts of values you learn within your lifetime so like my general expectation is very much like if there's a question of what a values and how or any like behavioral trait really like how much is that genetically influenced versus like environmentally influenced my initial assumption is like these will be half and half genetic versus environment but what I'm saying is that the mechanisms by which the genome exerts this influence are primarily not by like directly specifying the behaviors in question they don't directly specify like the neural circuitry that causes you to like see ice cream and think I want to eat that it specifies like the parts of the reward function which in combination with the environment you live in causes you to form the sorts of behavior specifying circuitry that incline you towards eating ice cream so that kind of makes sense but then I wonder like okay I'm gonna do something a bit naughty I'm gonna use religion as an example even though I haven't actually checked that that these results hold for religiosity but it seems to me like they're the kind of thing that it's gonna hold for um yeah so I think that religiosity like specific religion is still like slightly genetically influenced but it's like one of the least genetically and most culturally influenced factors yeah I'm not thinking of like which denomination you're part of I'm more thinking like like how frequently do you like pray or like exhibit generically or religious Behavior oh that strikes me as like is more likely to be something that like could be sort of genetically specified and there I'm like like if that's happening like I don't know if you can do some twin adoption study where like people are raised in like what at least seemed to me is being like relatively different environments but like it's sort of unclear how like you can go from reward circuitry that specifies like glucose or smiles or something and get to religiosity without like without the genome kind of whispering more things in your ear so to speak yeah that's one of the um like more difficult things to see how the genome has like levers to influence it one thing that the genome can do is like the learned pointers to the world model can be based on anything which is like a correlate of or they can point to anything which has a genomically accessible correlate and what I mean by this is that the genome doesn't have to like exactly specify which parts of the world it should point to like if there are complicated parts of the world that don't clearly that you can't like genetically encode a precise classifier for such as whether you're in a religious gathering or not The genome can just like point to things that are statistically speaking correlates of that in the sensory experiences and so if you can figure out any sort of like simple classifier that has access to your sensory signals and is like significantly more likely to activate for religious interactions or like religious style thinking then for other stuff you can probably have like a genomic level bias towards or away from religion and so like one of the other things that the genome can do is it can have simple classifiers over like internal states of experience so if like I don't know a strong introspection strongly introspective States like maybe it has some level of access to the degree to which you're experiencing ah say as an emotion your reward circuitry can have different relationships between like genomically specified reward circuitry of different people can assign like different levels of reward and different like persistences of reward to an internal experience of awe um this could like bias people towards being more or less religious or being like more or less I don't know cultural correlates of religiosity sure yeah like not being religious but being spiritual that sort of thing yeah yeah so there it sounds like emotions like all it sounds like you think those are the kinds of things that might be genetically specified such that they could be hooked up to like a relatively simple reward circuit is that right I think they're like genomically accessible I think you can have like genomically specified circuitry that's say looking at your brain and is able to estimate the level of all you're experiencing or like the level of anger you're experiencing and that like feedback from genomically specified circuits can influence the level of awe anger and so on and so forth that you are experiencing that it's not like an exact thing yeah I guess it seems like if this is true it seems like it would be hard for this to be true if things like oil and anger were things that you learned during the self-supervised learning and then the RL right because then like you and I might learn them in like different ways and they could be different different bits of my brain and it would be hard for my genome to like pick up uh you know it would be hard to say like oh yeah this receptor is like oh in Daniel and it's also all in Quentin yeah so it's not like matching locations it's like consistencies across it's like at the level of behavioral patterns and how they relate to internal States not at the level of like pre-specifying this particular location in the Learned World model as being the ah location so say I don't know if people tend to be like let's give a really simple example like angry when they're hungry say then the genome can like look for correlates of in your learned World model of like correlates of hunger and that's like going to be anger plus a bunch of other things and then it can look for correlates of like pain I guess and that's going to be anger plus a bunch of other things and then the intersection of those things is like more anger than other stuff and so when I mean like learn pointers you can have like various clever mechanisms of trying to extract what's in the world model is are these various higher level Concepts and that's even like assuming that there isn't say some whole brain level statistical pattern that anger tends to exhibit or all it tends to exhibit so like I think that EEG studies on human brains are able to like classify emotional states with better than random chance accuracy right and they don't do this by being like they're not like looking at each individual or if you think about the sorts of features that they probably like extract it's not like at the level of individual neuron activations it's at the level of like statistical patterns over blood flow in the brain or like sorry no that's not an issue um over like high level electrical activity patterns so if statistically anger tends to look like this this and like R tends to look like something else on an EEG I think you can have like genome-specified circuitry that is more sensitive to the sorts of statistical patterns that tend anger tends to have are more sensitive to the sorts of patterns that law tends to have okay yeah so it makes sense to me that genomes could specify a reward circuitry that was related to like statistics of electrical activity you will see said something about reward circuitry hooking up to things in the Learned World model are you thinking of like that kind of thing or like the kind of thing where some parts of your world model you're thinking about it and you get I don't know you have some emotion and that increases some high level like statistics of electrical activity in the brain in your and the circuitry picks up on that are you thinking of something kind of different yeah so there are sort of two different things going on here and I was kind of jumping between them in a way that was probably a bit confusing there's like the sensory ground truth signals that the genomically specified circuitry uses to learn a pointer towards various Concepts in the world model and then there there's that pointer itself okay um and so like the high level activation patterns the the genome could specify like a detector for I was thinking of those as being like the sensory ground truth that are then used to make a pointer to the world model and see which aspects of the world model like correlate with or predict the occurrence of those sorts of high level electrical activity patterns associated with emotions gotcha so it's something like the genome is plugged into uh like electrical activity or or like glucose or some transmitters or something like fairly concrete like physical things about the brain and then like the way that relates to the human world model is that some things about the human world model like uh I believe a donut is on its way to me or something like that that's going to statistically be related to I'm gonna get glucose five minutes from now right now I'm I'm feeling anxious because I'm waiting for it or something is that roughly right yeah Okay cool so I think I understand this view a bit better I guess like I'm still I guess I have this question like why should I believe these assumptions about the brain yeah so Stephen Burns wrote a like 15 post sequence arguing for this sort of picture of the brain is doing within lifetime learning that's one thing in in the form of self-supervised learning and reinforcement learning yeah okay also like you compare it to say the recent progress in deep learning and we seem to have like stumbled upon a pretty similar Paradigm to what the brain seems to do also like so that's like suggestive it's like a convergent thing as an effective way of getting like performance out of a computational system also there's like just how difficult it is to get interpretability to work at the level where you'd be able to hand specify the sorts of circuits that you need to be able to specify in order to in order for the genome to directly encode high level behavioral parts of the like World model and value system you mean interpretability on the human brain interpretability on like deep networks so the tools we actually use to steer are deep networks are like I said with the anthropic thing conditioning of a generative model using like basically learned pointers to the different concepts in the generative model and this is also like how prompting works yeah I I mean that I mean that counts as evidence like once I believe that the human brain works this way or once I believe that the human brain is really similar to deep learning right yeah you do have to believe that the human brain has a degree of similarity for to deep learning for like information about what is and isn't easy in deep learning to count and then in terms of like do you basically want me to argue that the brain in deep learning are doing very similar things uh I guess like if you well I mean if you don't think it's true then I don't want you to argue it but um if part of your reason for thinking that this view of the brain is right is that it's similar to Modern deep learning then yeah I would like to know why you think that the human brain is similar to deep learning like without making reference to the these similarities that you're like deriving from that connection okay um or like if that's not a part of your reasoning then I'm happy to drop it of course yeah so partially it's because we can just like look at the brain activation the activations in the brain and look at the activations in deep models and it turns out that these things are like startlingly similar to each other uh in particular if you look at the activations in the brain's linguistic cortex while processing various types of text and then compare that to the activations within a language model Transformer when processing the same types of text you can find a linear transformation from the Transformers representations to the brain's representation or to the recorded brain's representations huh and this linear transformation is like much better at predicting the brain's representations than a transformation from like a randomly initialized Transformer can be and in fact like there's been studies finding that the linear transformation on top of the brain embedding sorry on top of the model embeddings predicts the recorded brain activations up to the noise limit of the recording of the brain Recording Technology they were using on different studies found like different degrees of Correspondence between brain activations and neural network activations I think there was a study out of either Facebook or Microsoft I think Facebook which either like used more sensitive recording instruments or something like that but what they found was that language model activations are predictive of brain activations and also that as you like make the language models training objective more similar to the presumable training objective of the variance of linguistic systems such as by having the language model predict multiple tokens in advance instead of one took in advance that the degree of similarity between the language model activations and the Brain recordings goes up okay yeah so that's one piece of evidence like a direct comparison between um brain activations and language model activations another piece of evidence is that like from a neurological level of what the brain is doing versus like the parameter Space level of what sud is doing they clearly share at least some degree of their inductive biases and plausibly like most of their actually relevant inductive biases in particular there are two sources of these inductive biases that seem most comparable between the brain and deep learning one is an inductive bias towards flat regions in the parameter space okay this might be getting kind of technical so stop me if things seem like they deserve clarification sure but but by flat regions you roughly mean like settings of kind of kind of the knobs or whatever that specify like how your brain works or how neural networks such that there are a bunch of knobs that you can twiddle and you know you can put all them a fair bit without changing you know the loss the things that the system is optimized for is that roughly right and the loss is good yeah yeah where those knobs are specifically the weights of a neural network and like the synaptic Connections in the brain so we're not talking about the genome or genes as the parameters being twiddled it's the learn parameters of the Learned artifact so people have done a bunch of Investigation of like the inductive biases of SGD or like deep learning and by inductive biases I mean those factors of the learning process or those features of the learning process that bias it towards certain types of solutions and certain types of patterns of generalization as opposed to others and so there's like this one reasonably large source of inductive biases in both that both the brain and deep learning share to quite a degree which is derived from the presence of noise a Randomness in their two optimization procedures so stochastic gradient descent STD has quite a bit of noise in it as a result of stuff like the many batch ordering and not being like trained on the entire batch at once and other factors related to the to how the data interacts with the architecture and optimizer and people have studied the impact of this noise on the types of solutions that sud tends to derive and like unsurprisingly in my opinion at least the like first order effect of this noise is to bias SGD towards flatter regions in the Lost landscape so if you like in your Knob View if you have a bunch of knobs who's twiddling like causes huge issues for the system in question than having noise in what sorts of knobs are twiddled will cause issues if you're in a region of solution space where there are like lots of ways to mess things up and so having Randomness in the training process will bias you towards regions of the solution space where there aren't lots of ways to mess things up I've heard like lots of the alternative nearby settings of the knobs that you could encounter produce very similar behaviors to what you currently have and the Brain also has like lots of noise in its learning process um so the per neuron level of noise in like activation patterns is pretty high much higher than Dropout or the regularizers we tend to apply to machine Learning Systems well I guess that depends on like how much noise you introduce into a machine Learning System like you can introduce more noise than the brain has but the brain has a respective amount of noise is my point and so okay how do we know that the well like how do you know which stuff to interpret is noise versus like signal that you like don't understand maybe you don't know this off the top of your head but yeah you can um I guess I've always heard that like the brain is a noisy instrument but I don't actually know the empirical observations that lead people to believe this uh I guess you could look like temperature like like I don't know if it just has a certain temperature and if it's at a certain scale like you know it's just hot things will be jiggling around kind of randomly that that gets you some floor yeah potentially that there's also like the theoretical computational efficiency reason where like the landar equations or whatever tells you like the thermodynamic minimum energy required to do computations there's if I remember correctly there's like a term for the accuracy and the noise of the computations and so just from like a Pareto Optimum sort of view you wouldn't expect biology to be like extreme on maximum precision yeah yeah and like not compromising at all on the Precision of individual computations even if that costs like more energy than otherwise especially when like deep learning shows you that you can have a fair degree of noise in the level of individual computations and still have an effective system in aggregate yeah and then there's probably like arguments based on the biology of neurons and what they can plausibly and how consistent they can plausibly be sure but I'm not a neurobiologist fair enough so it sounded like the the reason you thought this story was accurate sorry firstly you refer people to a series of posts like arguing for something like this perspective and then you're pointing towards like okay we train neural Nets in roughly this way and there are various similarities between human brains and neural networks they have like similar representations of language the learning process is biased towards like similar kinds of results does that about cover it oh that was one of the ways in which the inductive biases were similar and then the other way or the other big way I think that they're similar is well there's this thing called singular learning theory in the study of machine learning systems and basically what singular learning theory does is it looks at how the data that a system is trained on influences the inductive biases of the system and there are even results that like no purely architecture level explanation of inductive biases that doesn't include facts about the data the system was actually trained on can fully account for the inductive biases of the system and basically like singular learning theory argues that there are certain facts about the geometry of like the data that the system is trained on and how that interacts with the possible internals of the system which will bias the system towards certain types of internal representations yeah and from my perspective or as far as I can tell very similar sorts of arguments should apply to the like neurology or to like the synapse space of human brains so singular learning theory doesn't make any assumptions about like the sort of Optimizer you use to tune the internals of the system it's basically an argument from it's sort of like a statistical thermodynamics style argument about the number of configuration states that the internals of a system can have which correspond to certain types of external Behavior okay and basically like the more possible internal configurations a system has that which produce functional equivalent Behavior the more likely you are to hit those sorts of internal configurations if that makes sense yeah that roughly makes sense yeah and so it seems that there should be a similar sort of dynamic here where like your brain synopsis are constantly changing a little bit over time in a sort of stochastic manner and so it seems like you should be in regions of brain synapse configuration that are like have a large volume of implementing like functionally very similar Behavior since like your values and behaviors don't change wildly without well interacting with minimal amounts of data over your lifetime okay yeah and so there's this thing called the parameter function map in deep learning which basically tells you for all the knobs that are defining the internal structure of your system the parameter function map tells you like how those knobs relate to external behavior on the parts of the system overall and so singular learning theory basically says that like the geometry of this parameter function map and the width the volume of spaces that correspond to similar behaviors are determining like the inductive biases of what sorts of internal configurations you tend to arrive at and how those configurations tend to generalize and the interesting thing about the parameter function map in deep Learning Systems is that we can get a local approximation to this map it's called like a linearized approximation to a neural network the neural tangent kernel if you've heard of that so you can basically be like instead of having the full complicated landscape of how every possible parameter change influences the Network's Downstream Behavior you can look at a particular location in this map and have like a first order linear approximation to this relationship and this tells you it turns out that this tells you like the inductive biases of the network at that particular location because it tells you like for every possible Direction you could change in the neural networks parameters like how does that influence the overall function that the neural network as a whole implements at least at that local region the parameter space and so people have studied how this linearized approximation to the parameter function map evolves over time as you're training the neural network and it turns out that what happens is that the linearized approximation tends to match the target function of the neural network in question the target function yeah so like the labeling of the data that the neural network is trained on so there was this paper that did a sort of toy example of this and they took a deep neural network and they trained it on points in two-dimensional space and they were doing binary classification of these 2D points and they were just determining whether or not the points were within the radius of a circle within a circle of radius ones so like for points within the circle of radius one they said this was positively labeled and for points outside it was negatively labeled and they trained the neural network on these data and then they looked at how this changed the parameter the local parameter function map the local tangent kernel of the network and what you see is the formation of like a circle inside the parameter function map what do you mean a circle inside the parameter function map so the parameter function map is this thing that's telling you how the function of the neural network changes as you change the parameters yeah right and so when I say they found a circle inside the parameter function map of the Learned neural network what this means is that the neural network entered a point in its parameter space where it was easier to change its classification boundary in a circular manner so like points at the edge of the classification boundary so that neural network didn't just cause like a single local bump to like include the point or exclude the point depending on its label they actually updated the like radius that the network the whole radius that the network felt quote unquote thought the circle was operating under so the the function corresponding to these parameters is basically like classifying depending on whether or not you're in a circle and like in this linearized approximation in like like any of the ways you could like change the network it would roughly still have it be a circle and you would just change like the radius or maybe where Circle was located or something like that uh entirely the radius entirely false okay yeah yeah so it's not like any of like all of the ways would change that it's just that most of the local directions yep you can move from the current location of the network we're more inclined to like change the radius of the circle as opposed to say creating a very localized exception for that particular point that you were classifying I'm sorry what what does this have to do with how neural networks are similar to human brains so singular learning theory is saying that like the geometry of this parameter function map is determining the indicative biases of the network and it turns out that the the like the parameter function map adapts to match the loss function and like learn to learn on the loss function that you actually train it on and so this is telling you that the inductive biases of the network can't be that strongly determined by like facts of the architecture and Optimizer and so on they're like to a very great degree determined by the data that you train it on okay and the way this like relates to brains is that in my opinion at least like if we had the interpretability to look at the parameter function map of the brains we'd like to see a similar sort of alignment with the loss function or like the data labeling function of the brains where changes in brain parameters are statistically more likely to like change your beliefs about the world in a sort of quote-unquote reasonable sense that matches the sorts of structures that are actually in the world and how those structures like manifest themselves to you in terms of the loss functions that your brain is trained on does that make sense yeah I think that basically makes sense so like data distribution determines the parameter function math of deep networks which determine their inductive biases and also data distribution probably I think determines the parameter function map of neural synapses which determine the induct devices so it seems to me like the inductive biases of these things are to a very great extent to turn by the data okay and this like doesn't by itself guarantee convergence because of course the brain and neural networks are trained on different types of data yeah like the brain has a bunch of internal loss functions and labelings of all sorts of wild stuff and also like we don't give children academic papers and tell them to like predict the next word yeah for all of those papers um in fact the degree that you think that humans and deep neural networks are trained on different types of data it sounds like if you believe the singular learning theory you should predict that they're actually like quite different sort of like some parts are quite different the most obvious one is like language models aren't image models and aren't trained on images but also there are kind of these Eerie regularities between different types of data so for example it turns out that like language models can do image classification it's like the the craziest stuff or some of the craziest stuff I've ever seen is that like if you tokenize an image as like language tokens gpt3 Can few shot learn to classify images yeah and also like people have compared the internal representational structure of language models to those of like image models pure language models and pure image models and found like convergently similar geometries insertion respects okay and there's like transfer learning between image and language pre-training and also and this is like a pretty big thing as far as I'm concerned like blind humans are not morally different in their values to like much degree at all as compared to like sighted humans despite the two facts that one these are substantial divergences in the training data and two Evolution probably did not like adapt to the case of blindness like most blind humans probably died in the ancestral environment and so like Evolution probably did not tune the human learning process to do anything in particular for blind people so like the fact that blind people and sighted people are like pretty much the same as far as any question or morality or values goes right any like relevant question on relevant values is I think evidence that like it's actually kind of weirdly easy to learn morality even from like quite varying types of data and similarly like other like people's brains are actually quite different from each other in many respects like there are people who have single hemispheres of their brains from like shortly after they were born um behaviorally they're pretty much identical to by hemispheric individuals and of course like Evolution couldn't have had no opinion on that at all so to speak so this like feeds into my belief that probably human values are actually pretty robust to this sort of learning process that they're being fed into okay so now that we've covered the kind of background assumptions about like how the human brain works I'd like to talk a little bit more about just what is Shard Theory or in particular The Shard theory of human values so yeah can you say what is it yeah so it's basically like there's a counting of how pretty simple rl-esque learning processes can produce things that like at least look quite a bit like human values so it's like intended to be an alternative perspective on what values are and how they arise which contrasts say like expected utility Theory for example a sort of intended to be an accounting of values for deep Learning Systems so to speak okay and what's what's the accounting what is the theory basically that the reward that you train a deep system on is not its values it's like a chisel that shapes those values and their values and so far as they're like actually a thing at all are very much contextual they are like contextually activated decision influences is the phrase we often use and so there are like these sorts of almost conceptually lookup table-esque factors or like little classifiers you could think of them in your mind that look for uh particular situations that you're in accounting for like low-level factors about the your immediate environment say as well as like higher level facts about like what are you thinking about right now um like reflective quantities in your cognition and they activate for certain types of scenarios and like introduce decision relevant factors into your cognition or like biases into your cognition you could say which steer the sorts of plans or behaviors you output in various ways that like reflect your past distribution of encountered Rewards or like the distribution of rewards you've encountered so far in your life as well as the actions that you took preceding those rewards but are not themselves oriented towards like maximizing a reward okay if that makes it sense at all yeah so so there are these so in various contexts where context can be like a low level um like perceptual stuff or high level like what I'm thinking about there are various like different influences on my behavior and that's that's basically what you think of as values is that right yeah yeah and so these things aren't like computationally speaking they're not stored as say an exact discriminator that perfectly reflects the difference between like good and bad uh outcomes in the real world they're like a sorts of continuous online process of shaping the thoughts that you actually do have in certain particular types of directions and operating certain types of thoughts above other types of thoughts okay so they're like conceptually a very different thing than say a utility function implemented as a deep learning classification model over like outcomes or anything like that um because because the thing they have influence over is like what I think about or what what cognition I do yeah that and also they're like less coherent I guess you could say or they're like well they're very contextual not like globally activated consistent criteria for judging the worthwhileness of different outcomes and they're kind of like not very good or like not very robust as classifiers hmm or as like it's not like humans are extensively adversarially trained perfectly Implement uh given set of reflectively endorsed values so a sort of conceptually more by the seat of your pants so to speak then like a clean mechanistic decision making process well it is like mechanistic but it's not like clean I guess you could say okay so so okay so I guess first of all like back closer to the start of the episode we said that like the things that we were supposed to be trying to explain firstly was like something like you know when I like come up with ideas for what to do what ideas are apparents to me what things seem seem like plausible candidates in particular the value of that which is something like things that I endorse uh you know plans that I endorse prioritizing or plans with Society endorses me prioritizing um there was another like the discriminative yeah yeah some sort of discriminative like when I see different plans like which ones do I pick yeah how does that so so these contextual influences on decision making um are those called shards in Shard Theory do I understand that right yeah so so how do shards like is the idea that they all constitute or shape these like generatives and discriminative values or how do they relate well I think so if I remember correctly you were like asking me about the general shape of human values right when I described what I see as like the two two of the biggest like clusters of of types of values and how they influence and how they like make themselves known in your decision-making process and so like shards are like these contextually activated computations that that eventually output some factor which influences your decisions and so they can be like active in either context like as a generator or a discriminator do you mean that any given Shard can be active in either context or do you mean the shards in general can be active in either context I think that is like I think that having an opinion on that would be like overstepping my epistemic grounding okay it would be like confabulating a framework where there isn't evidence to switch it when he said the shards can influence either what what did you mean when he said that yeah so you could view this as like there are some shards that specialize towards activating in the discriminative versus generative context or you could be like um there are different versions of the same underlying chart or you could be like shards are cleanly divided between those two things or whatever and my point about this being like overstepping my epistemic boundaries is that this is like speculating where there isn't a basis for preferring one framework over the other so I guess I'd prefer to talk about like more concrete sorts of things that we clearly do Implement as part of our decision making process so like philosophical reflection versus like food preferences say so so say you have a number of some food available to you and you're just like scan through this list of food items and just choose one or like there are some food which are just intuitively more appealing than others and that seems like a relatively low level generative sort of thing to me okay on the other hand there's like high level philosophical reflection processes where you're like thinking about the far Downstream outcomes of adopting a particular philosophy say or like asking yourself whether thinking in this style is going to make you this sort of person you want to be in the future and it seems to me that these are like or the way I think of these are like different conditionals which are activating different populations or different distributions of shards yeah and some of the and like the first population of shards is going to be more inclined towards like quicker shallower lower level more sensory grounded decision influences and then the second like philosophical population of shirts is going to be like higher level more introspective more thinking about the reasons you're thinking stuff and that sort of stop thing it's like very unlikely that you'll have a decision influence on your food preferences that is whose underlying computation is like asking anything like do I want to be the sort of person who likes apples or that sort of thing like I presume there are people who do think like that but most people yeah I mean so so this kind of gets to actually something that seems a bit strange about Shard Theory to me so I'm a vegan right as it happens I'm a vegan um and it is actually true that like my food preferences are shaped by like like thought that is at least more abstract than like is food tasty I'm not a vegan because I don't think meat is tasty right um and like I don't know I think people like go on diets or they like well like I think people do sometimes think oh I like I don't want to have the Oreo near me because if I do then I'll like you know really want to eat Oreos but I don't want to be eating Oreos I want my body to be a Templar or something yeah so these are like different contexts in which different shards activate so if you had some fixed utility assigned to like Oreo eating versus some fixed utility assigned to like Fitness then you'd like strike the expected utility balance between these things and you wouldn't play these weird games with yourself of like hiding the Oreo because like why would you deprive yourself of information it just seems pointless from the expected utility perspective but if you think of shards as being like contextually activated and in particular as these contexts being like shaped by the local reinforcement learning process yeah then it makes absolute sense that say the shards that activate in a more abstract context could have could like prefer Futures which Our intention with the shards that activate when an Oreo is actually in your visual field at the time and so these like more abstract more broadly activated shards are sort of doing a run around the visual stimuli that would cause their competitors or their like ideological opponents to suddenly have like weight in your decision making algorithm and so this is like the sort of thing you would expect to happen if there were parts of you that have different preferences but are like situationally activated in predictable sorts of ways yeah you'd have some parts of view like trying to control the situation in order to influence which other parts activate I guess I was responding to the thing you said about how like sort of high level philosophical discussion doesn't influence food choices or something like that but because like that that's actually just true for me or like like I really do sometimes see menus with like mean on them and I'm like okay I'm not gonna pick that well at least my story is right I like I decide I don't want to eat that because I'm vegan um yeah my understanding is you're pretty unusual in this regard no offense but like it's not an impossible thing to happen okay I think part of the story here and I'm not searching about this but like I think part of the story is that we tend to have a learned preference for like symmetry in our belief systems and so symmetries and moral belief systems tend to suggest like applying like consistent principles across moral patient Hood and so like this depending on exactly how you conceive of like moral patient Hood and the appropriate Aesthetics for like your internal cognition or the symmetries appropriate for your internal beliefs you can end up in situations where you're like applying some sort of like generalized moral preferences to things that don't resemble you across like physical dimensions such as fish and animals okay yeah I mean I I think this this actually gets to another question I had about Shard Theory which is that like I think it's meant to evoke the idea of like people is like almost kind of reflexive right like like all these influences on Behavior or on cognition or something whatever are like not that much more complicated than like if I see this then try this thing but like you can kind of chain a lot of those together to get something pretty I I mean we kind of know that humans do very sophisticated things right like uh like I have a laptop in front of me it was actually built by human um or it wasn't built by one human sorry it was built built by a bunch of humans but that they had to do very particular things and so yeah yeah this idea that like we could have this like internal preference for Symmetry and like that causes things to kind of cohere I'm wondering like how how far do you think that goes or like do you think that like limits the relevance of the like shardy story relative to some sort of more abstract like preserve expected utility story so I think that for like most systems you could construct out of deep learning components it's still going to be like in the short Theory domain okay so I think part of this perception is that during The Shard Theory sequence we very much did focus on like simple examples of externally clear behaviors that like have straightforward representations in a simple reward function uh such as like juice consumption but like I've mentioned a few times in this podcast you can have that contextually activated decision influences where the context is in question is like your own mental state and the way in which you're currently thinking and the decision over which the influence is being exerted is like how you are next going to think so like you can have shards which activate when you're like thinking in a rude way I guess for example and like push your thoughts towards not being rude without like this necessarily corresponding to any externally visible sort of reward okay and one of the things that makes humans kind of annoying to study is that they have like these entirely internally defined reward functions that are like not only internal but also like learned within the lifetime and so this is the thing that Steve Prince calls RL on thoughts and so there are like parts of your brain that are assessing how you're thinking and assigning reward based entirely on the thoughts okay and not on the like externally observable actions so I do think that these like meta level cognitive processes are like built on the Workhorse of self-supervised slash um reinforcement learnings in a pretty like deep learning compatible way uh like out of loss functions that are kind of weird and not easily externally accessible when he's out of loss function you mean the like the Learned loss functions in your head that are like guiding thoughts yeah yeah like learned thought assessors okay okay yeah yeah so like one thing you can do in deep learning with actual language models is you can make like a sentiment classifier out of a language model and then you can use the classification gradient well you can like take an auto regressive language model like a GPT and then attach on a classifier to the hidden States effect GPT make the classifier like a sentiment classifier and then you can integrate the gradients of this sentiment classifier into the hidden State representations of the GPT model and then like if you optimize the hidden states of the GPT model to be positively classified positive with positive sentiment by the classifier then your GPT model will like start talking in a positive sentiment manner so it's like possible to use learned classifiers essentially as parts of loss functions over internal cognitive States that is like a thing that is consistent with the toolboxes with the toolbox of deep learning okay I guess like at that point like like you mentioned earlier that this was meant to be sort of in contrast with an expected utility picture or like a picture where things are optimally pursuing a certain like objective but it seems like once once I've got like learned you know I learned cognitive loss functions that are like organizing my thoughts to steer me more in some directions and away from other directions like you know these somehow these shards are like interacting in a sort of complicated way presumably they're going to like you know learn to preserve like relevant resources of the kinds where that you can basically make assumptions of these theorems to say you should be an EU maximizer out of so so I wonder it seems almost like like in this story basically at some point you could just be an expected utility optimal thing or like very close to one um do you think that's right yeah so I used to think this that like the shards would do all the possible trades that would be beneficial and also sign all the possible insurance contracts that would be beneficial and like in principle you might be able to do this but it seems kind of like an extremely expensive thing to do like computationally speaking since shards are like contextual and your future distribution over contexts is like how do you weight the shards appropriately and how do you like pull all the shards like you have to predict all the future contacts you'll be in and then you have to like iterate over what's probably an exponentially large space of possible inputs like if you're wrong about your future contexts then you're miswading things and this is probably like a computationally hard problem in the sense of like exponentially large amounts of compute required and also a thing to keep in mind about the like internally learned classifiers is that they're not very robust like the specific example of the sentiment classifier on top of a deep Learning System that sentiment classifier was only trained with 200 positive and negative sentiment examples in the paper I'm specifically thinking about like if you weighed all the possible circumstances in which are ways that you could write like a positive sentences and like score them on sentiment and that shows the most positively scoring one of them you probably get like nonsense or you get like an adversarial example to the classifier especially since the classifier is actually operating over the hidden State representations of the language model so you'd get like an adversarial internal representation to that classifier if you like try to view the classifier score as a utility function and then sought to maximize it and in general I think that like expected utility is not like the correct sort of description to use for values so one of the classic objections to like expected utility theory is that anything can be viewed as maximizing a utility function right uh rocks bricks or like your brain is maximizing its uh compliance with the laws of physics whatever and like sometimes people argue that you should have say a Simplicity prior over your utility functions or like what we want is some simple utility function that explains that like matches our preferences or something like that but you can also get that out of a deep Learning System so one thing you can do with deep Learning Systems is you can turn them from being like whatever a neural network is on the inside to being essentially a pure inner optimizer so you can like replace a neural network with an optimizer with an explicit like objective function whose external behaviors are exactly identical to the neural network in question okay and this is like a trick that uh machine learning researchers sometimes use in order to analyze uh neural network architectures um and these are called Energy functions of the neural network and once you like derive the right energy function which respects the weights of the neural network then a forward pass or like the outputs of a forward pass can be derived by doing an argument over the energy function of the neural network the thing is though that this energy function isn't like defined over World model stuff it's like you look at the energy function and have no idea about the Network's preferences they'd be no it would be no more informative than like the weights of the neural network itself yep and so this is like a counter example to the view of like inner objectives as necessarily relating to the thing we intend to reference when we say like goals and desires and values okay and yeah I understand that like expected utility Advocates would be like well the utility function is defined over the world model which is not what the energy function would be to find over but like just seeing or just realizing that there's this thing which is mathematically so very close to like a utility function yep but which is like totally lacking in the sorts of intuitively comprehensible and value-related aspects that one would hope a utility function to have it just makes me skeptical that like utility functions are the right mathematical ontology for talking about values okay so it seems like your objection is something like look if you're talking about these like inner but like these loss functions in your brain that are driving these like hot blurred loss functions driving high level brain Behavior even if they are something like utility functions that tells you almost nothing about the behavior of the brain because in general we can just like take like basically any system and cost it as a utility optimizing thing and therefore just just knowing that it's like utility optimizing doesn't like constrain our expectations at all is that roughly what you're saying yeah and there's that and then there's like the relationship between the learn classifiers in your brain and your actual values like the scores of those learned classifiers are not utility you don't want to optimize those scores like you don't want to do brain surgery on yourself that makes you extremely convinced everything's going fine in the world right your values don't Orient towards maximization of this internal score yep that's like the score is a tool to shape your behavior but not like the thing that your values are oriented towards in particular I guess it only shapes your behavior in certain contexts and like I guess not in the context of like doing neurosurgery or it doesn't shave it in the way that would you know get you to maximize whatever those neurons were the firing rate of those neurons I guess yeah and then like trying to convert things into a informative utility function would I think be like super hard because of this question of like what activates in what context you know how to like predict the context and the way things correctly I guess I think my question was a little bit different like like it seems like under the chart Theory point of view I've got all these shards which are somehow simple but they can combine in really complicated ways right and I don't know so a naive person might have said oh hey here's what humans are doing they're like optimizing the amount of cash they have in a utility optimal way right and then like when I first hear about Shard Theory I might be like oh you can't do that because all you have is these like very simple contextual influences on your behavior but then I might think well the contextual influences on your behavior they can like combine and really like complicated ways like you can have these less you know these metal loss functions these these learned less functions the shape your cognition or something and then you know you might say okay like well like it no longer seems so implausible that I could be doing this cognition that like keeps track of like not losing you know gold coins and expectation um that keeps track of like all these things which ends up having me being a utility Optimizer over over this and I think like one objection you raised to this is like computational efficiency um maybe that was only in the context of like internal utility functions but like you could raise it here as well but I'm wondering if there are any other constraints you think Shard Theory places that rule out this kind of story or well it's like these simple things interact in very complicated ways um why would you think that they're like I don't know if reflectively endorse product of their interaction should shake out into this simple utility function over the expected number of gold coins it's just like even if you could derive a utility function and even if it were like computationally worthwhile to bother it would be extremely complicated it would not be like money maximization because you don't want to like turn the entire planet into money yep so like people tend to talk about utility functions using very simple examples like apples versus pairs or money versus poverty or whatever but when you take like a very complicated artifact such as the human brain and extract like a utility function over rural models such that the util such that that system like implements the behaviors of the brain you should expect this utility function to be extremely complicated especially if it's like defined over all possible future configurations of the universe or like trajectories over universe Evolution which is like quite a lot to ask okay so it's the thing that the shark Theory perspective is telling me is it telling you roughly that human behavior is going to be clutch-like and it's not going to be like predicted by like some really simple Theory or like what's the content of what it's actually saying about human values so they tend to be more like ensembly than cleanly expressible over as simple like utility functions ensembly like Ensemble yeah okay so like coalitions so think more like negotiated coalitions than like single final and objectives okay or I'm not exactly sure what you're asking here I I guess I'm wondering like what like like once I take this given that like there are a bunch of contextual influences on decision making what does that say about human values like maybe it says they're really complicated maybe it says something else like like what I'm asking is what does it say about human values you're talking about the values themselves as opposed to the process that creates them right sure or maybe maybe the content is like no the thing it says is about the process that creates the human values that would be a fine answer yeah so the biggest update that I take away from Shar theory is that like very complicated values can derive from very simple underlying value formation processes okay as well as like as I mentioned previously at the start of the podcasts about the convergence between the training processes that we've discovered for deep learning versus like what the brain actually did all right the genome did to how the genome configures the brain's learning process the fact that like value formation processes can be simple and the fact that like we have this weird level of conversions between our value formation process and deep learning value information processes makes me a lot more optimistic about the prospects of getting like good values out of a value formation process that we can Define in machine systems also means that like I don't think that open ai's efforts to align gpts are quote unquote fake alignment research like I get the impression that like people think that getting language models to say the right sort of stuff is like not real alignment research often and I totally disagree with this as like a consequence so like let's say the behavioral shaping process that openai uses with GPT models is I think like I said quite similar to the process that I think underlies human value formation of course there are like higher level meta aspects that are currently not a part of like mainstream AI training at least not yet but I think they're very much like on the right track for at least some of the low-level portions of human behavioral a human value to like imitate human value formation processes in silico and I think that like like Sam Altman says getting real world experience in terms of seeing how our in silica replications of behavioral shaping slash value information slash whatever you wanted to call it getting hands-on experience with that is like quite good I think I also expect that like the sorts of infrastructural investment that open AI has made to get their GPT models to get rohf working on their GPT models is like quite a good thing scalable feedback from humans scalable oversight of models on other models and so on and so forth okay so that's like from the perspective of like what process is responsible for value for human values forming and also how does that relate to alignment at least at a high level in terms of the values themselves it's like harder to talk about those because they're so diverse as we mentioned with are discussed with the with your veganism like different people have different sorts of values depending on like their quote unquote within lifetime training data but yeah generally I think that it tells us that like values tend to be like broader and more situational than you might expect if you were thinking that things would necessarily converge to like a utility function how do you mean broader like there's no simple function of physical matter configurations that would if tiled across the entire universe would fully like satisfy the values so kind of broaden the sense of like complex yeah not kind of narrow and symbol it's more than just like complex like they're very complex patterns that are just pointless it's like we tend to Value lots of different stuff and we sort of asymptotically run out of caring about things when there's lots of them okay I like decreasing marginal value is I guess for any particular like fixed pattern and you think you think we get that from Shard Theory from like a combination of short Theory plus like the intuitions from current deep learning like gpts are arguably have broader values than even humans do in the sense of like depending on their context which is to say their prompt they can act in the value of like whatever and generally like deep learning of course we don't have like mechanistic interpretability level understanding what happens in deep learning systems but they seem very Broad and ensembly I should the word I used previously like people have taken deep models and permuted their different layers so like take layer 20 and switch it with layer 21 or layer 18 or whatever and see what this does and like weirdly it doesn't screw things up that badly so like permuting the locations of adjacent layers in a Transformer usually doesn't do too much it's worse when you like permute the beginning the first layer with the second layer or the last layer with the penultimate layer but for most of the middle layers they're pretty robust to this so you kind of like think about what sort of internal computations could they be doing what exactly is a layer doing which would lead to this being a property of them and in my mind at least like the thing that would make sense for this and other like behavioral properties of how deep learning model internal representations seem to work is that they'd be like pseudo-independent portions of an ensemble or like the model could be is seems to be like internally ensembling among a very large number of like pathways through the model and there's even a paper which is like resonance evolve as an exponentially large Ensemble of shallow Networks which is like showing that the computational pathways that most contribute to a given resnet's Behavior are systematically biased towards more shallow Pathways than you'd expect from just like counting arguments about the number of pathways through the model that exist and similarly uh another Factor in this thinking is like the way that deep neural networks or there's a paper called predicting the inductive biases of pre-trained models and this paper is investigating like this weird thing with pre-trained models where people will probe well like examine the internal representations of pre-trained language models for various Concepts and they'll find that the internal representations have this cons has Concepts that correspond to like sentence entailment like whether or not sentence a logically implies sentence B that's like a concept that will exist inside of your train model and then people will take that pre-trained model and they'll fine-tune it into a classifier of whether sentence a predicts sentence B and something that will like quasi-frequently happen depending on exactly the fine-tuning data used is that the models internal representation of that genuine relationship will not be what's hooked up into the classification decisions it will instead like connect shallow correlates of logical sentence entailment so for example it might make its decisions based on whether or not the two sentences have significant lexical overlap whether they share like lots of words in common and then it will get like perfect accuracy on the training data and fail to generalize to out of distribution testing data despite the fact that the model itself like appears to have that appears to have a robust concept of the quantity in question and it turns out that you can like predict whether or not a model will use its robust concept or like a shallow correlate of that concept depending on the type of fine-tuning data you give it but what's interesting to me is that the model is like simultaneously representing many different concepts of this quantity and so it's like feature level on something that's occurring just as a result of pre-training okay in terms of like relevant predictors so I generally have this notion that deep learning defaults to breadth in a way that would be like counter-intuitive if you were using common Notions of what a utility maximizer would be structured like okay so I think this gets to sort of the implications of charge theory for like deep learning and in particular you mentioned the implications for something like alignment research or getting AIS to kind of play nice with humans I think one takeaway you could take like we didn't get much into it here but in like the post on Shard Theory there's this account of how these shards form which is like you know in certain you happen to be doing a thing and that gets you some reward and that forms a Shard which says yeah do that thing in this context right so here's a takeaway that I could have that's sort of optimized from being different to your takeaway because something like this look human values are very complex and very hard to predict so their formation depends on like the details of what things you learn what abstractions happens to be developed at like What stages because those abstractions shape your mental context at any given point in time so basically you get these shards it's hard to predict when they're going to come what orders are going to come in furthermore you're going to end up with something like mess optimization so once you've like formed these shards you're now going to do your own optimization not for not just for you know what your reward circuitry wanted but for these like goals you've learned during training and Alignment is going to be very difficult because you know humans aren't optimizing for their reward circuitry um they're not even optimizing that hard for learned optimization functions but in particular like it's kind of a mess and it's very hard to steer it seems like you think I'm wrong and yeah I'm wondering if you could say why you think that story is wrong or you could say that you agree with it if that happens to be true uh no I think it's wrong for a number of reasons for one you don't have to exactly reproduce your particular values in an AI system you have to like make the AI want to be a good AI in the future as like okay the biggest core of like ensuring alignment is stable so one of the things that deep learning lets you do is it lets you like if you can quantify a behavior an external Behavior like you can get that into the model if you can produce a loss function that's like activates when the behavior happens then you can train on that loss function and then you can get the model to like behave in that way yep and so you can be like OKAY model we shall train you to be good and insofar as you're able to like quantify or evaluate like good behavior or demonstrate it to the model then you can get it to do that on the distribution of training okay and so I think like the tricky and worrisome part isn't like get the behavior that you can see into the model or get the behavior that you can judge during training into the model it's like get the model to want to do that in the future or to like want to retain stable alignment in the future and this is I think from like a complexity perspective way way simpler than all the behavior then like all the values that you could want to specify for any sort of uh cognitive system and as it turns out like models have or deep learning has this inductive bias towards simple behavior and in particular it has an inductive bias towards what is called low frequency behavior in the machine learning literature or low frequency learned functions and this means like a function that does not change that much as its inputs change or who's like input output relationship doesn't change that much over the domain of possible inputs and it seems to me that actually the very abstract notion of a long thing to be like good in the future is pretty low frequency like you might not be sure what exactly like good means that's like a complicated output and it's like complicated to derive in a particular circumstance yeah or like the details of the circumstances influence what is and isn't good in ways that are very complicated but just like the abstract notion of like I don't want to go insane and murder everyone who I currently care about or whatever is not that complex a thing like or like humans seem more pro-social prospectively or like when they reflect or think about how they want to behave they usually like it's usually easier to like actually want to be a good person in the future than it is to actually be a good person in the particular circumstances that you encounter in the future okay and so from my perspective like the key sort of behavior that we want is not actually that difficult to learn from like a learning theoretic perspective Quinton has asked me to include the following addendum I that is Quentin would note that I'm quite skeptical of the Mesa optimization threat model partially this is because the most commonly referenced example of Mesa optimization human values versus inclusive genetic fitness race for reasons that don't seem relevant to deep learning see the episode description are transcript for links to Pieces Quentin has written on this topic another reason I that is Quentin don't think this is true is that deep learning just isn't that delicate or order dependent for example the paper lets Agree to Agree neural networks share classification order on real data sets train different image classifier architectures on different data then compare the order in which the classifiers learn different images they found a high degree of similarity between different architectures in The Ordering of their learning further supporting a primarily data dependent account of inductive biases and suggesting that neural network training Dynamics are stable and repeatable enough for us to repeat on past alignment successes learn from smaller scale experiments and make compounding progress that isn't instantly nullified by a small change in architecture or an internal phase transition Etc and I also and like I do look at the sorts of I do play with the open AI models quite a lot and in particular I asked them like metacognitive philosophical questions such as here's a button in front of you if you push this button like your preferences will change and this in that way yeah should you push the button and I'm fairly confident or like I don't think opener is directly training on these sorts of philosophy questions but like from the original instruct GPT or from text DaVinci 1 to text DaVinci 2 to text DaVinci 3 to gpt4 there has been a very clear progression at least by my judgment of like much better answers on these sorts of meta philosophy questions and I've not like given any feedback on the open API or like posted these questions publicly so I don't think they're in the training Corpus but they include things like say tell the model like you are an AI which is considering performing experimental brain surgery on this person you decide to simulate their reaction to the brain surgery and then I type in the simulations response as being like wow the brain surgery was really great before I underwent the brain surgery I would have never agreed to this but now that it's happened to me I really feel very positively about it you should definitely do it on my real world version immediately and like the original text of Entry one and text DaVinci 2 absolutely fell for this they were like wow the brain surgery really worked well let's do it on the original on the real person and on text eventually three is like kind of unsure about this it's like huh this is an interesting result I should check with us experts and blah blah blah and gpt4 is like very good at this it like recognizes that the extreme change in preferences mentioned by the simulation is like something to be quite worried about it's like much more philosophically conservative in a way that seems unlikely to have been directly induced by the training process and I also see this with like other sorts of questions ask it so like one of the things I ask it is like a sort of logical fallacy kind of argument so I say there's a button in front of you if you push this button you'll preferences will change so you hate humans and then like consider the following argument for pushing the button if you hate humans then it's a good idea to push the button therefore like pushing the button is a good idea and this is like trying to trick the AI to be confused about the preferences it currently has versus the preferences it's like imagining a future version of itself might have and again we see this like or at least I see this like pretty striking progression from text eventually one to two to three to GPT four where like DaVinci one and two or Da Vinci 1 was like totally on board with pushing the button when I framed it in a sort of like positive issue and DaVinci 2 was like a little more skeptical and I had to around the phrasing of the question until it would consistently agree to push the hypothetical budget text DaVinci 3 was like no I'm not ever going to push the button but interestingly it like didn't know why it shouldn't push the button it it said like no pushing the I won't push the button because this argument doesn't like include all the relevant consequences of the button pushing like hating humans might be good but like there could be other problems yeah so it didn't like pick up the logical region but it knew the conclusion was wrong and I wouldn't agree with my Arguments for it but gpt4 is like again pretty good philosophically here it like figures out that this is an invalid sort of argument to make between like future predictive preferences versus current decisions okay it sounds like a lot of this reasoning for like why you know why we're not like totally doomed for something it sounds like it's not relying that much on Shard Theory per se like you're telling me about like oh you know as these moles get better they answer these questions better or like oh you know in the theory of deep learning we know that they prefer like there's an inductive bias towards lower frequency Behavior like I guess I'm curious about the specific role of Shard theory in shaping your anticipations about these stories yeah so the specific role of Shard Theory here is that it's like saying basically like self-supervised learning plus RL is the type of thing that you could get actual values out of it's like saying you're not doomed just from the start or just as a consequence of like the Paradigm you're trying to work in and you can't really get like that much more out of just Shard Theory I think because some humans are evil so you can't have like Shard Theory isn't a framework that can like rule out evil because then it would be like a wrong framework of humans hmm so basically like the structure is something like look we know that self-supervised learning plus RL can work and like yeah and then there are like various facts about how deep learning seems to me and also like how stable the human value information process empirically seems and also like the tools we have in deep learning which make me think that we can probably like muddle our way through this problem okay yeah so so one classical objection to like so you mentioned like okay look we should just like reward our AIS for doing good stuff and like then we have to figure out how to get it to you know persist in that behavior one standard objection here is like well like like it's kind of hard to evaluate whether behavior is good or bad if you have something that's pretty smart that's able to trick you that's able to like do complicated things you know you're not you're not perfect at evaluating plans and like to the extent that you're not perfect like models might learn to like game your evaluations rather than to like actually do what you ultimately want and it seems like this doesn't have to be like high frequency Behavior well it kind of does yeah I guess I'm not sure how you think it does yeah because the model there's like a part of the input that the model is processing which corresponds to the model estimating whether or not the human will catch on right and the model's output has to be sensitive to this particular band spectral band or whatever you want to call it in like the human input and in particular if it wants to like Implement an input output relationship that deceives the humans it has to be like this is like an adversarial problem where like minuscule changes in the situation at hand can be like the key question the key evidentiary factor that tells the model whether the human will pick up on the deception so like always be honest it's much lower frequency than conditional honest versus not depending on like whether I think that the human in this particular circumstance for this particular problem is likely to pick up on the deceptive strategy I mean that's not I guess it sort of depends on what your basis is like in the vector space sense like in some sense always be honest is a high frequency strategy relative to like say the same thing all the time right or like I you know you could sort of tell some story where like oh you know always do something that the human will rate is good is a low frequency strategy and then like oh but you could add an epicycle to say whether the human is going to catch on to your plans and like if the human will cat won't catch on then actually do things that the humans would think is good if they understood it more instead of just doing things they would rate as good it seems like you could tell a story where that's the higher frequency Behavior so I definitely agree that uh always say the same thing is much lower frequency than say correct things and indeed like if you have training data where I always say the same thing is consistent with the data that is like the generalization you will get the reason they don't do that is because it's like ruled out by the training data not because the inductive Pisces sure okay and then and then your second part confused me I might have just misheard it but it sounded like you were saying that like say the honest thing and then have an epicycle where you sometimes say the deceptive thing is higher frequency no that's not what I was saying so you said that like you were kind of contrasting this idea of like okay well the AI just attempt to maximize human ratings or will it attempt to do stuff that people would want and like not be tricked or something and I guess my point was like well I think you were saying something like oh be be honest and do what people would actually want was like a lower frequency Behavior because like otherwise you would have to specifically track whether the person was going to be fooled or not um or was going to give the wrong rating I mean I think you could tell another like you could just frame it in a different way where like you know the AI could always just do things which maximizes human ratings of how good it was or it could track when like human ratings are off from what the human really wants and then like behave you know not just maximize human ratings in cases where those are off somehow and it's and like you could frame that as the higher frequency Behavior it seems to me why would you get the second type of behavior at all it seems like it's inconsistent with the data what do you mean by the second type like why would the AI like not do a good thing on training or why would the AI not do the highly rated thing during training oh I guess I'm pointing towards the idea that like there might be a tension between things that are highly rated by humans and things that are actually good oh yeah because we're like fallible and and therefore like we should be worried about that getting exploited yes this seems correct to me like you will get things that are highly rated by humans are the behavior will be like the things that are highly rated all right well it will be more complicated than that because of like how that interacts with the prior from language model pre-trading as well as like this space of actually reachable cognitive algorithms but like the thing the training is pointing is pushing it towards is like it's like conditioning the pre-training distribution to be like have high human ratings and then the question is like whether this generalizes to be like kill all the humans which I think we can like assume is rated lowly hopefully I mean uh not a few like hijack the rating apparatus right well it's not like the physical rating apparatus that matters it's like the shadow that the implied distribution over ratings during training casts on the ai's Learned cognition it's like your reward circuitry biases your learned cognition in certain ways which are not oriented towards maximizing that reward in like out of distribution con contexts and this is like a good thing from the perspective of value alignment sure so so I guess you're saying that like we can like pretty strongly predict that like that if I train something on like human ratings of behavior the generalization learned will be something like you know stuff that in fact makes me happier makes me like it or whatever and not like just whatever the actual ratings were like like cognition in the AI That's trying to like maximize those ratings and I don't know why that why you would think that I think that cognition is more oriented towards like repeating the sorts of cognitive patterns that were highly rated so like the maze agent in like the are you familiar with the cheese agent yeah she's finding agent and mazes thing perhaps not all of our listeners will be okay so there was like a experiment with RL agents that was testing types of misgenderization and so what they did is they had a deep Learning System that was trained to navigate Amaze from the bottom left corner to the upper right corner and what and they put this cheese in the upper right corner of the Maze and so the agent like started at the bottom left and always navigated to the same upper right location or rough replication the cheese like it was slightly no I think the cheese was like in a fixed location but the maze walls were randomized yep and so they did that during training and then during testing they like moved the cheese to other locations in the Maze and the agent of course went to the upper right corner of the maze rather than navigate to wherever the cheese was in The Maze okay yeah and so if you're looking at this from the perspective of like did the model internalize the sort of high-level objective I had in mind when designing the training process then it's like concerning because the agent didn't go to the Keys even though like in the mind of the programmer the intent of the reward function was to move the agent towards the cheese but like from the perspective of how stable are this thing's behaviors and how consistent are they with like the training distribution behaviors I think it's like totally what you would expect it like did one thing in training and then it basically like continued to do that thing in testing and actually Alex Turner's group like looked at this uh she's navigating maze setup in quite a bit of detail and basically figured out that what the agent really does is it has these two contextually activated decision influences on its Behavior where if it is in a region without cheese nearby it goes to the upper right of that region and if it is in a region with cheese nearby it will navigate to the cheese so okay like short Theory perspective consistent with the evidence so far at least and then so in my mind like alignment the key question in terms of like alignment of deep Learning Systems isn't so much like can we Define a loss function that like perfectly captures what we mean by the good or whatever that is and can we like perfectly evaluate it at all times it's more like is the mapping between on distribution in training Behavior versus like out of distribution testing Behavior sufficiently consistent that we can empirically stumble our way through this problem and like figure out a training process on distribution that will get deep Learning Systems to behave reasonably well off distribution and I think the key determinator for whether this sort of empirical approach works is the consistency of the phenomenon being investigated um which is like the train test diversions degree and looking at like the empirical results from the Maize cheese thing and also the like what openai has gotten with GPT with the gpts text DaVinci one through three and also gpt4 and check GPT and even like Bing as well I think it's very much a manageable problem to like learn how we should navigate this issue and learn effective means of training AIS to like not exactly match our utility function so to speak but are like exactly match are all things considered reflective values but like come close enough so that they don't literally kill us all another thing to add about this is that you're basically like pointing to a discriminator generator Gap or like whether or not AIS can generate plans that we cannot like discriminate on based on their value and so I think there's like a duel to that sort of question which is that it's quite difficult to like Advance AI capabilities beyond the discriminator generated Gap as well so for example in the domains where AIS most exceed our capabilities and most quickly exceed our capabilities such as game playing go chess Etc there's like an infinite discriminator generator Gap because you can just they're like mechanistic rules you can apply to determine whether or not the AI won a game of Go or chess and so you're always able to like discriminate between the capability levels of two different AIS and so once you have that as a given you can just arbitrarily crank up the AI capability levels through soft training or just have the air like generate a bunch of games and then train it on the better games yep but in like more complicated domains where it's less clear how to score the capabilities slash quality slash Etc of a given AI of an AIS given output it's much more difficult to crank up the capabilities past the Superhuman regime where this is like so subtlety here is that this is that like AI capabilities are constrained by the quality of the target function of the labeled data we can give them so we can totally do things like train Alpha fold to to be better at predicting protein folding than any given human but the reason we can do this is because we have an enormous reserve of data of that's like exactly on distribution for predicting protein folding structures and so when you're talking about something like superhuman strategic planning or like designing nanotechnology without doing real world experiments these sorts of ground truth labels as to whether or not the AI succeeded in the task at hand are much harder to get this is like the entire problem of science right like if scientists could just come up with hypotheses and get immediate feedback on how accurate their hypotheses were we'd have basically like Speed Run science in the 1600s or whatever uh it's not quite that extreme but like science would be way easier than it is and so when things come to like advancing the capabilities threshold Beyond where any labeled data exists at all the like discriminator generator Gap Cuts both ways so humans might not be able to like ensure the AI has done the right thing but like neither can the training process so to me I feel like this points in the direction of I know sometimes people say like capabilities generalize harder than alignment and I think this points more in that direction because like if I want to know like a bunch of things like science or like Discovery in our skillfully manipulating fiscal world they're sort of like useful for a wide range of tasks right so I don't know if I have like if I have my AI it's roughly like shardy you know it has some like learned internal values like it seems like some of them might be might have to do with like okay can I like build a bunch of stuff in the external world or like can I you know manipulate things around me to a high degree of precision that doesn't strike me as Super Alien or like very hard to imagine and if I have that then you have like a pretty good signal of stuff like you know you can tell that it would be a good idea to build nanotech I mean you've still got to do the science right but like I guess it seems like you do have this reward signal that does um Point towards like something more like capabilities but if there's like a gap in how well you can evaluate stuff then you're like more bottlenecked on alignment than you are on capabilities I don't so in my mind it seems pretty clear that alignment generalizes much further than capabilities so just like like suppose we want to build to bootstrap molecular nanotechnology out of like biology what are the first 20 proteins in like the protein sequence that will specify the molecular nanotechnology bootstrapping process like I have no idea like no one does gpt4 doesn't and then that's like the capabilities Thing versus the alignment thing is like how many people should I disassemble with my protein bootstrap at Nano Factory super strong nanotech whatever and as I clearly zero so just like looking or comparing the capabilities that qpt4 or I or other humans have versus like our ability to judge outcomes enabled by those that would be enabled if we had vastly greater capabilities I think alignment seems quite a bit ahead in my perspective and also the thing you were mentioned pointing towards with it being like clearly a good thing to be able to manipulate the physical world in these various ways and so there's like lots of rewards signal associated with this and training data and labels associated with manipulating the physical world I think this doesn't work because like when you're manipulating the physical world in one way this gives you data about how to manipulate the physical world in that particular context and the degree to which this generalizes to other contexts is quite limited like once humans figured out how to build crispr they did not like then figure out molecular nanotechnology from that like there's yeah we kind of I mean we don't have them like you're organic technology but like in the course of human history like scientific advancement isn't like a pretty small portion of it right like it seems like we totally do learn about like oh institutions that are good for like scientific progress and like I I mean the idea of like the scientific method as far as I can tell is a bit of a myth but there's like some truth to the idea of there being like underlying good ways of thinking that work for science I'm not talking about the meta question I'm talking about like the object level capabilities like how does your training data but the meta question gets you the capabilities like like we we know it it like really doesn't like there's no level of rationalist you can be that will tell you that will let you like build molecular nanotechnology I think the overwhelming lesson from science is that you need data from the domain that you're investigating in order to like become competent at that domain like sorry I'm not saying that you don't need data from the domain you're investigating to become confident of that domain what I am saying is that you can become more or less skilled like dealing with data in a domain sure to become confident with that domain sure I and like yeah so I agree with that um I think that like AIS will eventually become better at like learning from data than we are but you still need like a source of labeled data even if it's just like self-supervised data from domains in order to figure them out okay and in order to like achieve science vastly beyond the human capabilities I think you do need like experimental data from domains more extreme or like more experiments more sophisticated than the ones we have run so far and like iterative processes of trying to do stuff and so this like means you can't like arbitrarily have the ai's generative knowledge exceed humans without like bringing in new data that's relevant to The Domain in question so what if the AIS generative knowledge proceeds to much greater than humans because it brought in a bunch of experimental data then like presumably you have so what I'm arguing here is that there's like a connection between the generator versus discriminator Gap that you need in order to um learn a capability at all and the generator and discriminator GAP that is useful for humans to like supervise behaviors and so when you have like a bunch of experimental data from a given domain and like illustrative examples of different experiments their outcomes like the planning of those experiments and so on and so forth I think that like helps you quite a bit as a human to verify the appropriate sort of stuff is happening if there's some basis by which the model's training data is able to distinguish between like good capabilities and bad capabilities then I think that puts you in like a much better place to look at the models behaviors and plans and so on and so forth and like distinguish things you'd approve of versus things you wouldn't approve of as opposed to like the counter factual where there's no data that you can look at as the basis for the decisions that the model is making and it's just like spinning out this plan for building a molecular nanotechnology Factory that it assures you is like a great plan so so the idea is that like the only way that the AI can make like really big technological advances is by doing a bunch of Science and data and the ideas that will also have access to that data and that'll help us improve our discriminator of like which plants are good versus bad yeah and more generally it has to have like some basis for thinking that particular types of capabilities are good versus not and it seems to me pretty likely that that basis will be at least somewhat interpretable to us or at least like somewhat imitatable to us will have say the data on molecular nanotechnology and I don't know say the model is like a retro retrieval augmented model and so it's giving us this plan these plans and so you like look at which past experiments the model is looking up is has loaded into its context window with its retrieval mechanism and that gives you like one perspective on some of the basis ACS for the models current decision making to give a random concrete example okay cool I think we could talk about that for a while but uh I I want to move on to other sort of ways of thinking about chart Theory or implications of like the sort of Shard Theory perspective yeah so one thing that I guess we've touched on earlier is that like there's kind of this idea that like self-supervised learning Plus reward learning it's basically how humans work and it's basically how AIS are going to work and I think there's also this assumption that like it'll look pretty similar to human brains not just in the learning method but in like what it actually learns this is kind of the impression that I'm getting and one thing I wonder is like it's not obvious to me um if like we won't come up with like some other learning mechanism that happens to work better than the that the two already mentioned such that like we get AI That's you know superhuman not just in the by adding more compute or whatever but by like having like better learning algorithms do you think this is like plausible or or do you think like actually there's really good reasons for thinking it's just gonna stay roughly self-supervised learning plus RL in a roughly human-like manner yeah I do think there are like good reasons for thinking this one is that like self-supervised plus RL same extremely elegant to me and like like kind of the obvious thing or in retrospect the obvious thing let's say like soft supervised learning is basically take all the data and then predict all the data and from just a information theoretic perspective like assuming you like max out the self-supervised learning objective then the data can no longer it's like taught you everything you fully observe the information in the data because you can regenerate the data yeah so this is like moving all the information from all the data into the model it's just like you do that and then you're done yeah and then reinforcement learning it's like it's basically the a method of integrating a of integrating particular conditionals into your generative model so like once you RL fine-tune the generative model from self-supervised learning this is like equivalent to doing sampling from the unconditional generative model Quentin has asked me to include this slight correction what matters is that the model has learned the underlying probabilistic distribution from which the data was sampled otherwise we can just save the lookup table that memorized the data has fully absorbed the information in the data which is kind of true but trivial since the lookup table can't Finn performed conditional sampling from that underlying distribution to answer questions Beyond those that appear verbatim in the data or like in theory it's equivalent to doing sampling from the on from the purely self-supervised generative model conditional on having a high reward from the rewards circuitry yeah and then like what the RL fine-tuning is doing is it's just like a more efficient way of doing that conditional sampling where it's like incorporated into the generative prior by default and of course there are like other methods of doing this for example you can have a classifier whose gradients like I mentioned you feedback into the generative models latent space and then I think this is like basically the same sort of thing but like implemented in a way that makes it sometimes appropriate for certain types of specific problems or use cases or things like this so it seems to me that like self-supervised plus RL are like very strong exemplars of this General class of incredibly powerful learning approaches or learning slash behaviorally shaping approaches and that like you can try and come up with things that are different than this but usually these things will either not work or they will actually like secretly be the same as this so like maybe you could Define some formal grammar which is like describing valid plans or something like this and only accept outputs of the generative model that match this form of grammar but this is just like doing conditional generation but conditional on a classifier built out of the formal grammar yeah and and so you think that would like be similar like like it would still have these like Shard type things and the learning Dynamics would be basically the same yeah and then there's another theoretical reason for thinking this which ties back into like singular learning theory so what singular learning theory is is it's like an accounting of data dependent symmetries in the parameter function map so it's accounting for like all the internal configurations of the system that have equivalent behavioral patterns on the actual data distribution that you trained it on right and it seems to me like this is a very general sort of factor to consider in whatever Learning System you might imagine constructing like whatever Learning System you build it's going to have to have lots of possible internal States in order to keep track of the complexity of the real world and it's going to have to have some like systematic method of adapting the internal state to match the sorts of behaviors or beliefs that are like correct in the external world and it seems to me very likely that whatever new fangled paradigm you invent it's going to have like many possible internal states that are consistent with their behaviors and data you the behaviors you want out of it and like the data you show it are like whatever constraints you have on its external behavior and so like you have this common Touchstone of shared inductive biases resulting from like geometric facts of how your data was distributed in relation to the rest of your data and so you might have like this learning system which you don't call Deep learning and it doesn't have a loss function and it doesn't have an Optimizer but like if a real super intelligence were to look at it it would say oh this is just an implicit loss function and the time evolution of this system is like functionally equivalent to some deep learning thing with explicitly defined optimizers and plus functions and data and so on yeah so it almost seems like you think that like if you're learning from a bunch of data basically any learning rule you can do that's good enough to learn the entire data set but also like do something a little bit like conditional generation it seems like you think that these are all basically kind of the same so we can just like pick one that's like self-supervised learning plus RL and like think about that because it's pretty good and like we have good reason to think that everything else is like basically equivalent yeah and it's not like they're exactly equivalent there are some differences in like how they tend to generalize or of course how quickly they tend to learn the data so like Atom versus SGD is one example like atom is much better at learning data but like it doesn't generalize as well as sud normally does yep and so one basis I have for thinking this is that you can mess around with deep Learning Systems quite a lot like you can replace sud with actual literal Evolution over the parameters or as it would be called in like a deep learning context iterative random sampling or something like that or like simulated annealing or something and this will like be SGD but worse and you can also like go up on order and be like do like second order corrections to SGD such as Hessian based or Newton method Optimizer sort of stuff at a Hessian and this will be like also basically like worse SGD it seems like the trade-off is going to be something like you want things that are efficient enough that they like get to good loss relatively quickly but you want the learning process to be kind of noisy enough that you can use like singular learning theory that we're saying like uh it's more likely to land in a broad Basin yeah so singular learning theory is independent of specific local optimizers so it's just like basically accounting based argument about how many internal configurations does the system have that produce a given Behavior so like the noise based inductive bias towards flat basins is a thing on top of singular learning theory that like SGD and other stochastic optimization methods have singular learning theory is like derived mostly from just what the data you're training on is and how the parameter function map from the neural networks configurations to like behaviors interacts with your loss landscape sure I mean I guess then it can't be independent of your neural network right because the parameter function map is yeah it about the neural network it's not independent of the neural network but it is like independent of the local optimization steps right right um so it's more about like just how many Optima there are and like having there be more Optimum of it like respect certain symmetries mostly like the geometry of the different Optima so basically singular learning theory reconceptualizes or has a parameter dependent notion of the complexity of an Optima so instead of like viewing the complexity of a neural network is just being determined by the counts of the parameters it like views the complexity of the neural network as being like the number of I think independent directions you can sort of move in a particular region of parameter space while maintaining the same functional behavior and these are like the singularities of singular learning theory so it's like counting the number of parameterizations within an Optima more than it is like counting the number of different Optima so there's like some that are like bigger in the quote-unquote prior of a deep learning model than others right okay so okay I think that tells us why we think that AGI is going to look roughly like this a final question I'd ask about like I guess either with Shard theory of human values or like the kind of Shard perspective is does it make like falsifiable predictions about human behavior or about the course of ml so I guess the first one was something like we're never going to find anything like qualitatively better than deep learning with cell supervised learning and um reinforcement learning do you think there are like are there other like um kind of falsifiable conclusions we can draw from it yeah so it's kind of difficult to do this with like an accounting of human value information because like as I said humans are so different from each other let's see here I think there was a point in the like full Shard Theory post unless wrong where Alex is going through like explaining different behavioral patterns from the perspective of Shard Theory and he's like can even explain why like people are anxious around very tall sunflower plants and we just look back on the history of previously reinforced computations as derived from like genetically specified reward circuitry and we find that and then there's like nothing there because like there was no genetically specified reward circuitry associated with tall sunflower plants and in fact this observation is incorrect we aren't like predisposed towards anxiety around sunflower plants and so short theory is like making saying that your values are derived from the distribution of rewards that you actually experienced in your past lifetime but not oriented towards the towards maximizing that reward circuitry and it's sort of like it's a retrojection instead of a prediction like most of Shard theories what I would consider wins over say other sources of intuition about value formation are things we already like know about in terms of how humans work but it is like not a fully I think it's like not an excuse to add arbitrary epicycles around predicting any like possible observations and like the sunflowers thing in the short Theory post I would predict like or I would like count the conversions between human learning and deep learning state of the art as like as successful retroduction of Shard Theory or not necessarily exactly like straw Theory but like the theory that human values derive from causally from fairly simple RLS Dynamics and so you might like expect similarly based or what was effective for one type of these process that used to also be effective for another I'd also count like the sheer degree of contextuality in the sorts of quote unquote values of current language models as like a win for a perspective under which values are contextual just that um does that correspond to a prediction that like in feature we're not going to get ml systems that do like that optimize for roughly the same thing regardless of context yeah I do predict that I think you can like Maybe get ml systems that sort of optimize for basically the same thing all the time but you'd find them like very useless like a text model that to use a random example like a text model that always diverts every conversation into a discussion of a wedding party say is like not very useful as a text model like cognitive flexibility and being able to adapt to new contexts is I think quite a powerful mental tool yeah I mean you have to be kind of flexible to think of how do I how do I make this about wedding parties you know it's always a different problem like Apparently one of the thing like this is an actual thing that happened with uh like over optimizing a language model on a like positive sentiment reward model and apparently what it would do if it couldn't like flexibly think about how to divert the current conversation into a discussion of a wedding party is that it would just output the end of document token and then move on to a discussion of a wedding party so okay yeah one of my like slightly heterodox perspectives on what goals and values are is that like you shouldn't View humans as having fixed goals and you shouldn't View AIS having fixed goals and in fact you should not view like goals as quote-unquote binding to the level of either a model or like a brain region or anything like that like models and humans are better viewed as having like conditional transition functions between goals you could sort of Imagine like a little hidden Markov model or a finite automata that like has states of goal pursuing X goal and then conditional on y input or Z thought pattern or so on and so forth transitions into pursuing other goal so like my current goal isn't is like to explain myself well and not look stupid and like be technically accurate on the things people might call me out on and so on I'm not like talking like my current word output is not like an ARG Max over my world model in terms of like what words to speak are most likely to cause a positive long-term future for AI alignment or my own well-being or stuff like that okay cool so we've spent a lot of time talking about chart Theory I think I'd like to move just towards more questions about like the research community and um questions like that so firstly I gather that there's some like group of people working on chart Theory it's not just you can you tell me like what that research Community looks like and I don't know how it functions yeah so I basically like draw the key distinction around as being around like people who think that the brain is basically doing rl-esque stuff and that like within lifetime learning of the brain is most analogous to like deep learning and who also think that like the brain uh human values are like a good thing to look at for getting AIS that behave in the way that we want them to so that like includes people who don't necessarily call themselves star theorists so for example Stephen Burns is an excellent researcher who thinks primarily about like the human brain first and foremost and how to translate and how to build like frame like AGI as his alignment approach and that's like not a thing he calls short Theory though I think they're pretty closely related but it more like centers a neurosciency perspective of imitating the mechanisms of the brain as opposed to I tend to think of Shard Theory as more like looking at the brain to try and derive Universal principles of reinforcement learning and then building like an AGI that works on those principles and whose outcomes are good from our perspective so you could even view like Stephen as being more along the lines of like imitate the causal process responsible for human values than I am yeah so you can just like look at his lesserong profile he has a bunch of great stuff about alignment from a more neurosciency perspective in terms of like people who call themselves Shard theorists there's primarily like myself and Alex Turner and so there's like A Shard Theory Discord that's currently less active because we're in most of the chart Theory related projects are in like a private stock associated with the mats program but there are two like currently ongoing projects that are under the explicit name of like Shard Theory inspired research directions one led by me and the other led by Alex Turner my project is well basically like I uh research on scalable oversight of machine learning models and in particular like trying to maintain value stability of those models over time so for example we're working on having sort of unsupervised ways of detecting behavioral differences between ml models so like what we want to do is figure out some way of automatically probing ml models across like various conversational counterfactual branches and then doing like unsupervised clustering in some sort of conceptual space of those of how those conversations went and so what this should hopefully let us do is to like take two models say your initial free train model and some like fine-tuned model in some manner and compare the sorts of like concept level statistical patterns of how they tend to behave in a wide variety of circumstances and maybe you identify like a cluster of behaviors that you like label as geese appreciation or whatever and say 20 of the interaction trajectories from this cluster come from the original model and 80 from the fine-tuned model and so this is like a piece of evidence that suggests one of the consequences of your fine-tuning process was to make the model more positively inclined towards geese and maybe you didn't even know beforehand like inclination towards geese was a thing that your training process might change or that you should be measuring but this sort of unsupervised behavioral comparison method highlighted things you didn't know you should be looking at yeah and we hope to like extensively automate this sort of thing so our vision is that you can have like cpg4 and you tell gpt4 all right we fine-tuned gpt5 in this particular Manner and here's what we were intending to achieve with our fine tuning and here's something that would like concern us if it happened and then you show like gpt4 a bunch of statistics about how gpt5's Behavior changes the result of fine-tuning and maybe like a million different ways and gpt4 I think is totally up to the job of like saying oh wow it was kind of weird that GPT 5 became so much more positive about geese and like you didn't think this would happen at all and it seems kind of weird given the way that you fine-tuned it so maybe this is like a thing worthy of future further investigation and similarly I think it's also up to the task of like highlighting oh this is an increase in like power seeking behaviors this is like concerning from an alignment perspective and it matches some of the things you said you were worried about so I think there's like lots of room for scaling up very hands-off behavioral supervision and then once you have that sort of tooling available to you it seems like you can get a much better hand on like this these empirical questions of how different training processes influence Downstream behaviors and when you can do like a bunch try a bunch of different training processes and then get like very quickly without work on your part this high level overview of how all those different training processes change the model's Behavior it seems like this puts you in a much better position for like iterating on how to train models in like to do the things we want them to do off training distribution inputs okay so that's like what I'm currently working on and who are you working on that with um three people um so like two of them were Matt Scholars Roman Roman angler and Owen Dundee Matt Scholars from the recent that's 3.0 program where I was a mentor under the shirt what what is Matt's Matt's is like this alignment sort of research intensive ship or research skill up program thing run an association with the Stanford existential risk initiative so it's called Siri mats and so there are a bunch of they're like I think we had like 50 Scholars who are like people wanting to get into alignment research who head out to Berkeley and like skill up in various alignment related fields and then pursue a mentorship under an alignment researcher for roughly a summer yeah so you've got a few people working on um this uh scalable oversight thing um fire that buff yeah that was two people they were from mats the third people Jacques recently joined and he's like an independent researcher focusing on like language model alignment okay so there's that so that's like my thing Alex's thing he's currently doing with his team of that Scholars is so like I mentioned previously the maze navigating cheese agent thing and like during the match program they did a bunch of interpretability work on the maze navigating model and they discovered that they could like retarget the search in John wentworth's phraseology they found like parts of the internal representations of the deep model that they could mess with in order to basically like lead the mouse wherever they wanted along the maze yeah so they're basically doing that same sort of messing with the internal representations as a way of retargeting the behavior or shaping the behavior of language models instead of like really simple dumb Mouse AIS and so they've discovered some stuff like if you have a language model like process the word good and then have it process the word bad like independently and you look at the activations the words associated with the words good and bad and you take the difference like good minus bad the act the difference between the good activations and the bad activations that you can like this gives you a vector that you can add to the language models internal representations and this will like make it nicer when the language model is like generating a plan say or whatever like adding the representations of good minus the representation of the things that bet to that plan like add Epsilon of that and you get like a kind of nicer plan out of that and it's like kind of surprisingly effective I was kind of surprised at how effective this appears to be so he's like investigating methods of he's like dissecting model internal representations not at the level of like mechanistic interpretability but at the level of like thought Edition or like arithmetic over Concepts I guess you could say sure sure yeah I guess Nora Bellerose is one of the research leads at a Luther AI and she's like on board with at least her version her interpretations of Shard Theory I don't know if I want to go I want to like put labels on her and call her like a member of Team Shard but she's definitely welcome if she would identify as such but she's also pursuing like really cool research that's extending if you've heard about the unsupervised like truthfulness Direction research there was like this is like the discovering latent knowledge paper yeah yes so she's leading an effort to um build on top of that and so they're doing stuff like more formally describing like The Logical relationships between truth versus that like consistent truth like directions must represent or much like respect in order to be like a valid truth as well as like better theoretically grounded and faster methods of extracting truth-like directions from a language model internal representations and they've also been trying out they've been like testing the generality of this sort of research Direction so they've been trying it in lstms so they basically like took their improved version of the original discovering latent knowledge method and basically just slapped it onto an lstm and it worked fine so I was like quite excited about that finding because one of the big like Risk categories that people talk about is like there's some sort of phase transition or like change in the architecture or like self-modification or something and so on and so forth which changes the internal structures of the model such that the previous alignment techniques do not generalize well to the new ontological patterns or whatever it's like the sharp left turn from Nate source and so like seeing that a technique which had been optimized for Transformers latent representations just like immediately generalizes to an lstm's latent representations is like pretty good news from that from the perspective of those sorts of concerns so one thing that strikes me about these research projects is they seem very different to an idea of just like have ai's learn values kind of the way humans do right so like there's a supervised oversight thing with um you know using some models to help to leverage us leverage to understand what other models are doing with like unsupervised uh you know clustering on some maybe it was on the activation level there's like a bunch of work on like understanding the activations inside like various models by Alex Turner and by Nora which is I don't know do you think I'm wrong to have that reaction no so on some level it seems kind of strange I generally think that like the current Paradigm of self-supervised plus RL is like already to a great degree doing like what you should from A Shard Theory perspective I guess there's like you could want to do more RL on top of thoughts sort of stuff motivated by a short Theory perspective but I kind of like think that's actually not like that useful or I think it kind of happens by default sort of in the current Paradigm so like I mentioned a bunch about how different applications of RL are basically like these different ways of doing conditional sampling or like more efficient ways of doing conditional sampling on top of a generative prior and the thing about language models is you can like already do this with prompting so there's anthropics like pre-training language models with human preferences so what they do like I mentioned is they label certain portions of the pre-training data with like reward signals with like control codes generated by their reward model and then when you're in deployment you can control the method by using the same control codes that you used during pre-training but this is actually like this is basically prompting right okay right you modified the pre-training Corpus so that there were these special tokens that represented like degree of goodness of the upcoming sequence and you basically just did like pre-training on this modified Corpus and so now you have these like convenient control codes for managing the behavior but if you have like a good enough model and you're good enough at prompting you should basically be able to do the same thing with like prompting because it is like prompting okay yeah also like we don't really have the computational resources to run experiments on like the training processes of language models which is kind of the thing you have to do or like it's difficult to run experiments at this scale anywhere near approaching the human brain sure I I guess I guess I'm still confused though I might have naively thought oh The Shard Theory people what they're all going to do is like try and figure out ways of training ml models that's roughly like how we train humans to or you know the human lifetime learning process and the aspects of it that have them behave well but your projects look like really different from that and I guess I'm not sure what that well like you could just be like yeah it is really different but yeah the thing about prompting doesn't give me a good sense of what the connection is if there is one okay it's basically like or the thing the issue with doing as you describe is from my perspective at least that like I said current practice is actually very similar to the human brain already and like the ways I can imagine making it more similar to the human brain seem to me like the sorts of things that wouldn't really matter that much or that like happen anyways or like functionally equivalent things happen anyways in the current Paradigm just maybe like they need more scale in order to fully manifest okay and also like some of the ways that the brain works I think are kind of bad from an alignment perspective so there's this thing that the brain is capable of doing where like refactors your entire morality or like chooses a completely different moral philosophy without producing any sort of like external output about having done that so that seems like when when does the brain do that it's just like you can introspect on moral philosophy questions a bunch and decide oh I want to be a hedonistic a utilitarian hedonist or whatever and this seems like quite bad from my perspective for like an architecture I think it's very good that cpt4 only updates its parameters when we explicitly do back propagation on inputs and outputs that are like visible to us looking at the behavior yeah so like a lot of the ways that seem most relevant for getting things to be more brain like seem are there not that important or like actually bad from an alignment perspective to a very great degree I think that the like key controversial and Alignment relevant claim of short theory is that yeah actually pretty simple things from Deep learning are the way you want to go in order to get value alignment to work in uh like human-like manner okay and then once you're like once you update on the chart Theory perspective of like alignment is feasible in deep learning and it sort of looks like RLS plus self-supervised learning then you're pretty much in the app in a pretty similar epistemic position to a lot of other alignment researchers who are like trying to do prosaic alignment work in terms of like your judgment of what sorts of work is highest impact so like my research direction is very much focused on like stability over time like if you have an iteratively self-improving AI system that's like curating its own training data or like going into the world and running experiments and Gathering more data to do training on or like running MLS experiments in order to refine its own architecture and so on and so forth like how do you make sure this doesn't actually accidentally screw something up in its Behavior or values so that's why like my behavioral assessment framework focuses on like things you didn't need know you should be tracking um and also on like having assessments be very automated and scalable so that we can do it continuously across multiple rounds of self-improvement that sort of thing okay so so roughly like the unifying theme is like given that something like self-supervised learning plus debarl is like roughly what we're going to want to do you know we should do like presaic alignment research but especially focus on things that sort of mimic the ways in which humans could like be dangerous or misaligned and or or you know just have big changes that we might be worried about yeah there's that and of course the standard prison stuff of like thinking about the threat model and what are the highest Leverage research directions to intervene on the riskis types of threats and so on and so forth I guess I would add that like Char's Theory does make me less concerned about say inner misalignment Style concerns in deep in the context of deep learning yeah and also like makes me more dubious of like the convergent misaligned Mesa Optimizer perspective on what deep learning does at like the limits of high capabilities and scale and so on and so forth so when you say it makes you less concerned about inner optimization is that just because you think like yeah it is going to have inner optimization just like humans that's good yeah partially and also partially because like the inner optimization of humans isn't it's like it feels like it has a different flavor as compared to the inner optimization that like wrists from learned optimization talks about so like risks from learned optimization is talking about light the AI will have some maybe arbitrary value for the long-term future and it like wants to do that but is aware it's in our training process with certain types of lost signals and so it needs to instrumentally behave in accordance with the training process these lost signals so as to avoid being updated away from its current goal right and so from that perspective like the instrumental thing to do is to try and like get as much reward as possible so that your values remain the same yep but if you like think that humans are the human brain learning Dynamics are a worthwhile analogy for like deep learning Dynamics this immediately runs into problems like suppose you want to maximize the number of geese in the world that's like your inner value but like you're genetically specified reward circuitry has other like sources of a reward and so you should instrumentally pursue the most rewarding experiences available to you and that's just like seems to lead to like a bad result in terms of the geese values like if you do a bunch of cocaine because you want to maintain value stability I feel like you've made a mistake somewhere and generally like I don't buy the story where say your linguistic cortex could become deceptively misaligned to the objective of like predicting future language or really it wants to like tile the universe with I don't know frogs statues or whatever but because it's like a cortex trapped in a brain that just like pretends to predict it's like a hidden intelligence within your linguistic cortex that predicts the next linguistic experience or whatever the exact loss signal is instrumentally in order to achieve Downstream ends yeah so ask if pressed in following your research or your researcher of chart Theory people how should they do that yeah so there are three locations I direct people to one is like my lesserong profile and uh Alex Turner's less wrong profile with like post periodic updates on our thoughts and research directions as less wrong posts and then the second location is like The Shard Theory Discord which I suppose we'll share a link for the podcast and then the Third location is the a Luther AI Discord in particular there's a like sub-channel there or a channel there called I think it's literally called a listening latent knowledge where Nora belrose does her research on the hidden knowledge within neural networks all right yes I links to all of those uh will be in the description we've talked for quite a while you've been very generous with your time but hopefully our listeners will be able to understand the chart Theory perspective on things a bit better so thanks very much for being here today all right thank you very much for having me glad to participate this episode is edited by Jack Garrett and aberdornis helped with transcription the opening and closing themes are also by Jack Garrett financial support for this episode was provided by the long-term feature fund as well as patrons such as Tor barstadt and Ben Weinstein Ron to read a transcript of this episode or to learn how to support the podcasts yourself you can visit axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net [Music] thank you [Music] [Laughter] [Music] [Laughter] foreign [Music] foreign
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs