Library / In focus
AXRPCivilisational risk and strategy
Singular Learning Theory with Daniel Murfet

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Singular Learning Theory with Daniel Murfet, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 119 full-transcript segments: median 0 · mean -3 · spread -20–0 (p10–p90 -10–0) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.
Slice bands
119 slices · p10–p90 -10–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 119 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrpcore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video hdB9gIwD6x4 · stored Apr 2, 2026 · 3,587 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/singular-learning-theory-with-daniel-murfet.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody in this episode I'll be speaking with Daniel murfett a researcher at the University of Melbourne studying singular learning theory for lks to what we're discussing you can check the description of this episode and you can read the transcript at axr p.net all right well uh welcome to aerp yeah thanks a lot cool so I guess we're going to be talking about singular learning theory a lot during this podcast um so what is singular learning theory yeah so singular learning theory is a subject in mathematics you could think of it as a theory of a mathematical theory of beijan Statistics that's sufficiently General with sufficiently weak hypotheses to actually say non-trivial things about neural networks which is been a problem for some approaches um that you might call classical statistical learning theory uh so this is a subject that's been developed by a Japanese mathematician suo Watanabe and his students and collaborators over the last 20 years uh and we've been looking at it for three or four years now and uh trying to see what it can say about deep learning in the first instance and um more recently alignment sure so what is it like what's the difference between singular learning theory and classical statistical learning theory that makes it more relevant to deep learning yeah now the singular and singular learning theory refers to uh certain property of the class of models so in statistical learning theory you typically have uh several mathematical objects involved one would be a space of parameters and then for each parameter you have a probability distribution the model over some uh other space and uh you have a true distribution which you're attempting to model with that pair of parameters and models uh and in sort of regular statistical learning theory you have some important hypotheses and those hypotheses are firstly that the map from from parameters to models is injective um and secondly quite similarly but a little bit distinct technically is that that if you vary the parameter infinitesimally the probability distribution it parameterizes also changes and so this is technically the non degeneracy of the fiser information metric but together these two conditions basically say that uh changing the parameter changes the distribution changes the model and so those two conditions together are sort of in many of the major theorems that you'll see when you learn statistics things like the crr bound many other things ASM totic normality which describes the fact that as you take more samples your model tends to concentrate in a way that looks like a gaan distribution around the most likely parameter so these are sort of basic ingredients in understanding how Learning Works in these kinds of parameterized models uh but those hypotheses do not hold it's quite easy to see for neural networks um I can go into more about why that is but so the theorems just don't hold now you can attempt to sort of make use of some of these ideas anyway um but if you want a thoroughgoing deep theory that is beian and describes the learning process the basian learning process for neural networks then you have to be proving theorems in the generality that singular learning theory is so the singular refers to to the breaking of these hypotheses so the fact that the map from parameters to models is not injective that means in combination with this other statement about the fiser information metric that if you start at a neural network parameter then there will always be directions you can vary that parameter without changing the input output Behavior without changing the model and some of those directions are kind of boring some of them are interesting but that's what singular learning theory is about accommodating that phenomena within the space of neural networks yeah and and I guess I so the way I'd understood it is that this basically comes down to symmetries in the neural network landscape like uh you know you can like maybe scale down this neuron and scale up this neuron and if neurons are the same it like doesn't matter um but not only are there symmetries they're like there are non-generic symmetries correct yeah because like if there were just some symmetries maybe you could like you know mod out by the symmetries and then like you know the things kind of like if you looked at the normal direction to the space at which you could vary things maybe that would be fine yeah um but so the way I've understood it is that there are certain like parameter settings for neural networks where like you can change it you know one way or you can change it another way MH but like you can't change it in both directions at once mhm and like you know there are other parameter settings where you can only change it in one of those two way so you can't do both at me it's smth manifold and different different places means that it's not this like kind of generic thing over the whole space like some models are like more symmetric than others and that ends up mattering correct um is that yeah I would say that's mostly correct I would say the word symmetry is really not uh I mean that's I think I would also at a high level maybe use this word to a first approxim in explaining what's going on but it's really not a sufficient concept um but yeah it's it's good to distinguish the kind of boring generic symmetries that come from like the nonlinearities uh so in some sense that's why you can just look at a neural network and know that it's singular because of these symmetries like the ra you scaling up the input and scaling down the in output um weights respectively will not change the behavior of the network so that's sort of an obvious scaling Symmetry and uh that means that it's degenerate and therefore a singular model but if that was all there was then I agree like somehow that's like a sort of boring technical thing that doesn't seem like you really need from a point of view of understanding the the real phenomena to care about it too much um but the reason that SLT is interesting is that as you say uh different regions of parameter space well you could say have different kinds of symmetries as a reflection of the different ways qualitatively in which they're attempting to model the true distribution um but like this other thing you mentioned about being able to move in different directions that's not really symetry so much as degeneracy and and yeah so uh we could go more into conceptually why different regions or different kinds of solutions might have different kinds of degeneracy but at a high level that's right so different kinds of solutions have different kinds of degeneracy and so being able to talk about different kinds of degeneracy and how they trade off against one another and why basian learning might prefer more or less degenerate kinds of models is the heart of SLT yeah sure so I guess before we go into that what what do you mean by degeneracy yeah so degeneracy just refers to this failure of the map from parameters to models to be injective so uh a degeneracy would just mean a particular kind of way in which you could vary the neural network parameter say in such a way that the input output map doesn't change so and as you were just mentioning you might have at one point two or more essentially different ways in which you could vary the parameter without changing the loss function and that like that is by definition what geometry is so that what what I'm describing there with my hand is the level set of the loss function right it might be right the minimal level set or some other level set but if we're talking about multiple ways I can change the neural network parameter without changing the loss then I'm describing the configuration of different pieces of the level set of the loss function at that point and that's that's what geometry is about yeah sure so um what so you mentioned that like singular learning theory or I guess SLT for short is very interested in like different kinds of degeneracies um yeah can you tell us a little bit like what you know what are the kinds of degeneracies like what different kinds of degeneracies might might we see maybe in deep learning and you know why does the difference matter yeah um I think it's kind of easier to start with a case that isn't deep learning if that's all right so I mean deep learning is sort of jumps straight into the deep end in terms of and it's also the thing which we understand least perhaps um but if you imagine a loss function which um so the easiest kind of loss functions and when I say loss function I guess I typically mean population loss so not the empirical loss from a fixed data set of finite size but the the average of that over all data sets so that's somehow the theoretical object whose geometry matters here so I'll flag that and that's you know sort of some uh interesting subtleties there but so in a typical case uh so in a regular statistical setting so not neural networks but um like linear regression or something um M the population loss looks like a sum of squares so just a quadratic form um and they're minimizing it I mean maybe with some coefficients so the level sets are ellipses um then the level then the learning process just looks like moving down that potential well to the Global minimum uh and that's kind of all that's happening so in that case there's sort of no degeneracy right so there's just one Global minimum and you can't variate at all and still have zero loss okay um a more interesting case would be where suppose you have 10 variables but a sum of eight squares so X1 squ through X8 squar and then if you minimize that well you've still got two free parameters so there's a two-dimensional space of global Minima of that function now imagine a population loss which um and let's only care about local Minima which has many local Minima at various Heights of the loss Each of which use different numbers of variables um so if we suppose for instance that the um uh so the global minimum maybe uses all 10 but then there's a level set a bit higher than that that uses only nine squares and a level set a bit higher than that that uses only eight squares and so then those have different amounts of degeneracy so uh you have different points in the parameter space uh loss landscape where local Minima are have different degrees of degeneracy and so then you can think about uh the kind of competition between them in terms of trading off between uh preference for degeneracy versus preference for loss and then we're getting into sort of key questions of um what the beian posterior what what if you're a beijan like what kind of solution you prefer in terms of accuracy versus degeneracy and I guess this sort of gets to this object that people talk about in singular learning theory called The Learning coefficient um can you can you tell us a little bit about what the what the learning coefficient is yeah so in the case I was just describing it's easy to say what the learning coefficient is um so there's a distinction between uh sort of Global Learning coefficient uh and so this what I'm about to say I mean everything I say about SLT more or less is material that was introduced by watab and written about in his books and at some point I guess we'll talk about our contributions more recently but mostly what I'm describing is is not my own work just to be clear um so so I'll mostly talk about the local learning coefficient which is a measure of degeneracy near a point in parameter space if I take this example I was just sketching out so you sort of imagine the global minimum level set and then some higher level sets and I said that the population loss uh near the global minimum looked like a sum of 10 squares um and so the local learning coefficient of that would just be 10 divided two so a half times the number of squares that you used uh so if there was a level set that had uh used only eight squares then that's degenerate because you have two free directions so it's not just it's not a single isolated minimum but rather a sort of two-dimensional plane of Minima and each point of that two-dimensional plane would because it locally looks like a sum of eight squares have 8 ID two as its local learning coefficient and so on so if you use D Prime squares in the local expression of your population loss then your local learning coefficient is D Prime on two that's not how it's defined it has a definition which we could get into various different ways of looking at it but that's what it cashes out to in those examples sure and so I guess like I guess the way to think about this local learning coefficient is that when it's lower the you know that's a solution that's more um that's more degenerate and the way I gather basian inference works is that it tries to have like both like a low loss and also a low local learning coefficient yep that's great um so that sound right yep sure um I guess like I don't know one an image I often see in discussions of single learning theory is you know like people drawing Doodles of like yes troil and figure eights and sort of you know there and you know maybe uh maybe a circle to to throw in there and I guess um the the thing I often hear as a character is like you know initially like you stay around the troil for a while this is where you put your um you know your posterior mess M until you know at some point you get enough data and then you start preferring this figure eight and then you get even more data and then you get start preferring this like Circle which has like maybe even lower loss um so like as as you go down you get you know maybe you get better loss uh let's just say but um the local learning coefficient is going to um increase and therefore get worse um maybe I'll Cy out that a little so yeah it's it's sort of the local learning coefficient is increasing so you're accepting a more complex solution uh in exchange for it being more accurate yeah yeah so I guess that's kind of the the very basic idea of single learning theory um why does it matter like what's what's kind of the important um differences between you know the singular learning theory picture and the classical statistical learning theory picture yeah um in what context sort of deep learn statistical learning theory in general deep learning the Theory alignment or all three in that order or um maybe all three in that order I think like um I'm most interested I I think I want to put off the discussion of alignment relevance for a little bit later um until we just understand you know what's even what's going on with this whole thing okay um yeah I guess I didn't actually come back to your question about the local learning coefficient in neural networks from earlier um but I think the cartoon in terms of sums of squares might kind of still suffice for the moment um yeah so if we talk about um statistical learning theory and sort of machine learning or deep learning in general so the I think the main highlevel conceptual takeaway from singular learning theory when you first encounter it should be that the learning process in basian statistics uh really is very different for singular models um so let me Define what I mean by learning process so I guess when we say learning process in uh deep learning we tend to mean training by stochastic gradient descent um and what I'm saying is maybe related to that but uh that's a tricky point so let me be clear that uh Invasion statistic statistics the learning process refers to as you see more data you become you change your opinion about what the relative uh likelihood of different parameters is so you see more data some parameters become ruled out by that data because they don't predict they don't give that data high probability um whereas other parameters become more likely and what I'm describing as the beian posterior which assigns a probability to each parameter according to the data and so as you see more samples I mean we've seen very few samples you really have no idea which parameters are correct so the posterior is very diffuse and will change a lot as you see more samples because you just are very ignorant um but ASM totic normality in sort of regular statistical learning theory says that as you see more samples um that process starts to sort of become more regular and concentrate around the true parameter um in a way that looks like a gan distribution so that's in some sense a very simple process right um but in singular models that is not what happens uh at least that's not what's predicted to happen by the theory until relatively recently I think we didn't uh we didn't have many very compelling examples of this in practice uh but what the theory says is what you were describing earlier that the beian posterior should kind of jump um as the trade-off between accuracy and complexity changes which is a function of the number of samples and those jumps kind of move you from regions of qualitatively different solutions uh to other kinds of solutions and then eventually maybe uh ASM totically to like even choosing among uh Perfect Solutions depending on their complexity and then so on so there's a very complicated not very well understood process underlying learning in basan statistics for singular models which as far as I know watab and his collaborators are the only people to ever really study this is like despite being somewhat like old in the sense that watab and and students and collaborators have been working on it for a while it's really not been studied in great depth outside of their group so a very fundamental process in beijan statistics relatively understudied uh but arguably at least if you take a beian perspective very Central to how Learning Works in say neural networks whether they're artificial ones or even possibly biological ones so um I think that's the main thing I mean that's not the only things singular learning theory talks about it's not the only sort of theoretical content um but I would say that's the sort of main thing I would want someone to know about Theory as it stands right now um the the other thing is how that relates to generalization um but maybe I'll pause there sure um I guess first um so yeah maybe we should talk about that a bit so um I hear people talk about this with the language of phase Transitions and I think upon hearing this people might say like okay you know if you look at like if you look at loss curves of you know big neural Nets that are being trained on language model data you know the loss kind of goes down over time and it it doesn't appear to you know be stuck at one level and then like suddenly jump down to another level and then be flat and then suddenly jump down mhm you know we have um we have things which kind of look like that in toy settings like um like gring like the development of induction heads but um you know it doesn't generically happen so to so should we think of these kind of of these phase transitions as being relevant to actual deep learning or are they just a theoretical curiosity about the ban Theory mhm yeah I think that's it's a very reasonable question uh I think kind of a year ago we ourselves uh were skeptical on this front so I think even in toy settings it wasn't very clear that this theoretical prediction Bears out so maybe maybe I'll spend a moment to uh just be quite precise about the relationship between theory and practice in this particular place so what the theory says is ASM totically in N the number of samples a certain formula describing the posterior works and then based on this formula you can have the expectation that phase transitions happen but uh in principle you don't know lower order terms in the ASM totic and there could be all sorts of shenanigans going on that mean that this phenomena doesn't actually occur in real systems even toy ones so you know theory on its own I mean in physics or in machine learning or whatever has its limits because you can't understand every ingredient in an ASM totic expansion so even in toy settings it was reasonable I think to have some skepticism about how common this phenomena was or how important it was even if the theory is quite beautiful um okay so that aside um okay so you go and you look in toy systems and you see this Behavior as we did um and then yeah I think it's reasonable to ask well okay so maybe this happens in small systems but not in large systems and indeed in learning curves we don't see we don't think we see a lot of structure so uh I'll tell you what we know and then what I think is going on so I should preface this by saying that actually we don't know the answer to this question so I think that it Still Remains unclear if this prediction about phases and phase transitions is actually relevant to very large models so we're not certain about that I would say uh I would say there's a reasonable case for thinking it is the case that it is relevant um but I'll I want to be clear about what we know and don't know okay so what we have yeah and again this is kind of an empirical question because uh the theoretical situation under which uh phases and phase transitions exist I mean the theory sort of stops at some point and doesn't say much at the moment about this scale or that scale okay so what we know is that if you look at uh Transformers sort of around the scale of three million parameters so trained on language model data sets um that you do see something like phases and phase transitions that basically describe uh so again what I'm about to describe is the learning process of training rather than seeing more samples um but the sort of theoretical jump that we're making here is to say okay if beijan statistics says certain kind of structures in the model well the if the if the theory says there should be qualitative changes that uh in the nature of the way the posterior is uh describing which model models are probable if there are qualitative changes in that over the course of the bean learning process as you see more samples then you might expect something similar when you go and look at uh seeing cumulatively more examples through the training process of stochastic radient descent but that is not like a theoretically Justified step at this point in uh in some rigorous sense right that's the kind of prediction you might make assuming some similarity between the learning processes and then you can go in empirically and see if it's true all right so if we go and look at um language models at the scale of three million parameters um this was a recent paper that we did developmental landscape of in context learning um if you go and look at that what you see are that the training process is divided into uh four or five stages which sort of have different qualitative content um in a way that isn't visible in the loss code mostly so it's it is a little bit visible yeah I would agree with that yeah I mean to the same extent that the induction bump is sort of visible um in the original uh in context learning in induction heads paper yeah I mean it's not like it's not obvious from the L curve it's it's not like everybody already knew all the things that you guys found out but yeah I would say that like if you without these other results if you looked at the Lost curve and try to tell a story about these little bumps it would feel like Tea Leaf reading uh but once you know that the stages are there yes you can look at the loss curve and sort of believe in certain features of them yeah I think that's fair sure yeah so that uh I mean there's various details about how you think about the relationship between those stages and phases and phase transitions in a sort of sense of SLT but um I would say that's still a very small model um but not a toy model in which you do see something like stage-wise development and there are independent reasons uh you know people have independently been talking about stage-wise development in learning systems outside of SLT U so would say that uh the SLT story and sort of stagewise development as a general framing for how structure arrives inside sort of self organizing learning processes um that sort of dovetails pretty well so I would say that it back to your question about structure in the loss curve um yeah just because nothing's happening in the loss curve doesn't mean that there isn't sort of structure arriving in stages within a model and our preliminary results on gbt too small at like 160 million parameters uh that at a high level it has stages that look pretty similar to the ones in the 3 million parameters interesting so okay here's my guess for what's going on it's true that in very large models uh the system is learning many things simultaneously so you won't see very sharp transitions except possibly if they're very Global things like switching to in context learning as a mode of learning seems like it affects most of the things that a system is learning right so a qualitative change at that scale maybe you would guess actually is represented sort of at the highest level and might even be visible in the loss Cur um in the sense that everything is sort of coordinated around that like there's before and after um but you know many other structures you might learn while they're developing somewhere else in the model it's memorizing you know the names of US presidents or something which just has nothing to do with structure XY Z and so in some sense the Lost curve can't possibly hit a plateau because even if it's hitting a critical point for these other structures XY Z like it's steadily making progress memorizing the US presidents right so it's like it can't be clear plate um so the hypothesis has to be something like um if there is stagewise development which is reflected by these phases and phase transitions it's in some sense or another localized maybe localized to subsets of the weights and maybe localized in some sense to certain parts of the data distribution so the like Global phases or phase changes which touch every part of the model and affect every kind of input are probably relatively rare but that isn't the only kind of phase phase transition stage um to which you know basian statistics or SLT could apply sure should I imagine these as being sort of singularities in like in a Subspace of the model parameter space like like the learning coefficient kind of picks them out in this Subspace but maybe not in the whole parameter space yeah uh that's kind of what we're thinking are you like these questions are pushing into areas that we don't uh understand I would say so sure I can I can speculate but I want to be clear that some parts of this we're rather certain of right so the the mathematical theory is very solid the observation of the correspondence between the theory and beian phase transitions in toy models is empirically and theoretically quite solid um this question of like what's happening in very large systems is a deep and difficult question uh about which you know it's so what I mean these are these are hard questions um but I'll say that yeah so I think that's right that's the motivation for so one of the things we're currently doing is what we call weight restricted local learning coefficient so this basically means you take one part of the model say a particular head you freeze all the other weights and then so you treat um so let me just give a sort of a more formal setting so uh when we're talking about the posterior and the local learning coefficient and so on WE imagine a space of parameters right um and so I want to like some of those parameters and so there's like d Dimensions or something some of those directions in parameter space belong to a particular head and I want to like take a parameter that like at some point in training has some values for all these different heads I mean for all these different weights uh and I want to freeze all but the ones in the head and then treat that as a new model so now my model is I'm not allowed to change those weights but I'm allowed to change the weights involved in the head and I can think about the beian posterior for that model and I can talk about its local learning coefficient uh and so that involves perturbing the parameter nearby that particular coefficient but in a way where you only perturb the weights involved in that part of the structure say that head and you can Define the complexity of that local learning coefficient and that's what we call the weight restricted local learning coefficient um and then yeah the hypothesis would be that if a particular part of the model is sort of specializing in particular kinds of structure and that structure is developing then you'll be at a critical point for some kind of restricted loss that is referring only to those weights and that would show up I mean we haven't talked about how the local learning coefficient is used to talk about phase transitions but that's the experimental way in which you'd attempt to probe uh whether some part of the model is doing something interesting undergoing a phase transition separately from other parts of the model yeah yeah actually maybe maybe we should CL that how how do you use the local learning coefficient to figure out if a phase transition is happening yeah so I guess uh depends on your background which answer to this question is sort of most Pleasant um so for physics e people who know about free energy they're familiar with the idea that various derivatives of the free energy should do something disc continuous at a phase transition and you can think about the local learning coefficient as being something like that so that if there is a phase transition then you might expect this number to change rapidly relative to the way it usually changes um but if we just stick within a kind of statistical learning theory frame um so we were laying out this picture earlier of as you see more samples um the beiji and posterior sort of concentrated in some region of parameter space and then rapidly shifts to be concentrated somewhere else and the local learning coefficient is uh a statistic of sample from the bean posterior so if the bean posterior shifts then this number will also shift so the expectation would be that if you measure this number which it turns out you can do from in experiments um if you see that number sort of change uh in some significant way then it is uh perhaps evidence that uh some qualitative change in the posterior has occurred um so that's a way of detecting phase transitions which is um like if you take this bridge from be and statistic statistic to statistical physics uh like pretty well Justified I would say sure and and I guess just a question about that so my understanding is that trying to actually measure the local learning coefficient involves you know taking a parameter setting and you know looking at a bunch of parameter settings nearby on you know all these Dimensions that you could VAR it and measuring a bunch of properties and you know this is the kind of thing that's easy to do when you're you have a very low dimensional parameter space you know corresponding to a small number of parameters seems like it's going to be harder to do with a higher number of parameters in your neural networks how like just practically like how large a model can you like efficiently measure the local Co learning coefficient at this time yeah that's a good question um I think it's It's Tricky uh so maybe this will be a bit of an extended answer but I think it'll be better if I provide some context so sure when we first started looking at SLT uh myself and my colleague here at the University of Melbourne Susan way some other people this was before uh I mean Believe It or Not uh today there are like 10x the number of people interested in SLT than there were uh back when we started thinking about it it uh was an extremely Niche subject very deep and beautiful uh but somewhat neglected um and our question at that time was exactly this question so the theory says the local the learning coefficient the real log canonical threshold is another mathematical name for it um the theory says this is a very interesting in variant um uh but it's very unclear if you can accurate estimated in larger models and a lot of the theoretical development was like you know you you use one PhD student to compute the RLC of one model theoretically and that's you know you need some hardcore algebraic geometry to do that etc etc so like the way the subject sat it wasn't clear that you could really be doing this at scale because it seems to depend on having very accurate samples from the posterior via marov Jane mon Carlo sampling or something so I I admit I was actually extremely pessimistic uh when we first started looking at it that this was like a that there really would be a future in which would be estimating our lcts or local learning coefficients of you know 100 million parameter models okay so I think that's where I started from um so my colleague Susan and my PhD student Edmund Lao um decided to try sgld stochastic gradient langen Dynamics which is an approximate ban sampling procedure based on using gradients um and to see how it worked uh so there's a step in defining I mean in estimating the local learning coefficient where you you need samples from the posterior and as you're describing this as famously difficult for large dimensional complex models um however there is a possible loophole right which is that I mean I don't believe that anybody has a technique nor prob ever will for understanding or like modeling very accurately the beian posterior of very large scale models like neural networks I don't don't think this is uh within scope uh and I'm skeptical of anybody who pretends to have a method for doing that um so hence why I was pessimistic about estimating the LLC at scale because it's in invariant of the beian posterior which seems to have a lot of information about it and I believe it's hard to acquire that information but the potential loophole is that maybe the local learning coefficient uh is a lot like relies on relatively robust signals in the beian posterior that are comparatively easy to extract compared to knowing all the structure and that seems to be the world that we are in uh H so to answer your question um so Zack Ferman and Edmund Lao um just recently had a pre-print out where uh using sgld it seems you can get relatively accurate estimates for the local learning coefficient for deep linear networks so just products of matrices so know nonlinearities H it scales up to 100 million parameters 100 million with an m with an M yeah okay um now one should cave out that in several ways but yeah okay and am I right this is a distinct from the quantifying degeneracy with the local learning coefficient paper that's right so this is a second paper kind of a followup to that um I forget the title I think it's estimating local learning coefficient at scale um yeah so we wrote a paper um couple of years ago now I think um looking at so defining the local learning coefficient which is kind of implicit in Rab's work but we made it explicit it and sort of making the observation that you could use approximate sampling to to estimate it and then sort of studying that in some simple settings uh but it remained very unclear how accurate that was in larger models now the reason it's difficult to go and test that is because we don't know the true local learning coefficient for very many models that can be like increased in some direction of scale um so right we know it you know for like one hidden layer 10h networks and things like that um but some recent very deep interesting work by uh Professor Micki yo yagi uh gives us the true value of the local learning coefficient for deep linear networks which is why Zach and Edmund studied those to see and this was an opportunity to see if sgld is garbage or not for this purpose so I should flag that like despite how should I say this so sgld is a very well-known technique for approximate Vasan posterior sampling I think everybody sort of understands that you should be skeptical of how good those posterior samples are um in some sense so it might be useful for some purpose but you shouldn't really view it as a like a a like a universal solvent for your beian posterior sampling needs or something um so just using sgld doesn't magically mean it's going to work so so I would view it as quite surprising to me that it actually gives accurate estimates at scale for deep linear networks now having said that deep in networks could be I mean they are very special and they are less degenerate in some important ways than sort of real neural networks with nonlinearities Etc so don't take me as saying that we know that you know local learning coefficient estimation gives accurate values of the local learning coefficient for language models or something we have B no idea about that um but we know it's accurate in deep linear networks and um so the the sort of okay so then what is generalizable about that observation um so I think it leads us to believe that maybe estimating the LLC like sgld is actually not garbage for that how good it is we still don't know but like it's like maybe this cheap posterior s Ling is still good enough to get you something interesting and then the other thing is that well what You observe in cases where you know the True Values is that when the model undergoes phase transitions which exist in deep linear networks as many people have sort of maybe not in those exact terms but stage-wise development in deep linear networks has been studied for quite a long time um and you can see that this local learning coefficient estimator which is measuring the complexity of the current current parameter during the learning process does jump in the way you would expect in a phase transition uh when deep linear networks kind of go through these phase transitions well and and it had to because we know theoretically what's happening to the geometry there um sure th those jumps in the local learning coefficient uh in other models like these three million parameter language models or gbt too small when you go and estimate the local learning coefficient you see it change in ways that are indicative of changes in internal structure now we don't know that the absolute values are correct when we do that and most likely they're not but I think we believe in the changes in the local learning coefficient reflecting something real to a greater degree than we believe in the absolute values being real um but still like I don't know I don't know theoretically I mean I don't know how we would ever get to a point where we would know the local learning coefficient estimation was accurate in L models absent really fundamental theoretical improvements that I don't see coming in the near term um but yeah that's sort of where we are at the moment fair enough um so a while back you mentioned that the contributions of singular learning theory understanding deep learning um there was something to do with phas Transitions and there's also something to do with generalization I think you mentioned yeah so I want to ask you about that especially in the context of I sometimes hear people say oh the fact that statistical learning theory you know statistical learning theory you know it says that um model classes can have these parameters that have some degeneracy and that basically reduces their effective model count or their effective parameter count and this just explains how generalization is possible um this is the kind of story one can tell when one's feels excitable um but you know it's it's a bit more complicated you know it's going to depend on details of you know how these parameters actually Translate into functions and you know which um which sorts of like what these degeneracies actually look like in terms of predictive models so I guess yeah what does singular learning theory tell us about generalization particularly in the context of deep networks um yeah so this this is subtle uh so on its face uh singular learning theory the theorems describe relations between loss local landscape geometry this local learning coefficient um and generalization error in the beijan sense so in the beian sense what I mean by generalization error is the KL Divergence between the true distribution and the predictive distribution um maybe I should say briefly what the latter is so if you if you take an uh you're you're trying to make a prediction if you're talking about a conditional distribution a prediction of Y given X X um and you look at all the parameters that you've got for modeling that relationship and you wait you're given an input and you take the prediction from every single model parameterized by your parameter space you weight it with the probability given to that particular model by the beian posterior uh and you average them all in that way so that's the beian predictive distribution obviously radically intractable to like use that object or find that object so it's a theoretical object but that's like the kind of I mean that probability distribution is probably not one that's parameterized by your parameter space but you can cook it up out of models in your parameters space I mean parameterized by parameters paramet um and the the kale Divergence between that and the truth is the bean generalization eror okay um the K Divergence just being a measure of how different probability distributions are right uh okay so that seems like a very theoretical object um there's a sort of closely related object the Gibbs generalization eror which puts some expectations in different orders uh which sort of closer to what people in machine learning mean by test error uh right taking it Taking a parameter and trying it out on some samples from the true distribution that weren't used to produce that parameter um okay so there's the various subtleties there um so SLT strictly speaking only says things about that those kinds of generalization errors and sure the relationship between that and test error for a parameter produced by a single run of SGD well I don't even know that that is a mathematical object actually right like test error for a parameter after a single run uh but you can do things like talk about for some distribution of SGD runs what's the expected test error um and arguably okay so then there's a gap between that bean story and what you mean by test eror in deep learning um this Gap hasn't been sort of very systematically addressed but I'll I'll lay out like some story about how you might bridge that eventually in order to answer your question so um if you believe that the beian learning process sort of ends with a distribution over parameters that look something like the end points of SGD training or at least close enough that something like this average of SGD runs of the test era looks uh a bit like averaging over things in the bean posterior of some generalization quantity that makes sense in the beian theory then you could maybe draw some connection between these two things that hasn't been done I don't know if that's true because these questions about relations between the be and posterior and GD are very tricky and I don't think look like they're going to get solved uh soon at least in my opinion so okay there's a gap there that's one Gap right if we just paper over that Gap and just sort of say okay well fine let's accept that for the moment and just sort of treat the generalization error that SLT says things about as being the kind of generalization era that we care about uh what does SLT say um okay maybe I'll insert one more comment about that relationship between test error and deep learning and B and generalization error first um so this is a bit of a tangent but I think it's important to insert here so various people when looking to explain the inductive bias of stochastic gradient descent have hit upon a phenomena that happens in deep linear networks and similar systems which is a kind of stagewise learning where the the model moves through complexity in an increasing way um so we think about in deep linear networks or what sometimes called Matrix factorization where you're trying to use a product of matrices to model a single linear transformation people have observed that if you start with a small initialization the the model starts with low rank approximations to the true linear transformation and then like find a pretty good low rank approximation and then takes a step to try and use linear approxim linear transformations of one higher rank and someone moves through the ranks in order to try and discover a good model um now if you believe that then you sort of would believe that if SGD training is doing that then it will tend to find the simplest solution that explains the data because it's searching them starting with simpler ones and only going to more complicated ones when it needs to okay now theoretically that's only known to happen I mean I think it's not known to happen in deep linear networks rigorously speaking but like there's expectations of that empirically that happens and there's some partial Theory okay and then it's a big leap to believe that for General SGD training of of General neural networks uh okay so I think we really don't know that that's the case um in general deep learning but believing that is pretty similar to believing something about the beian learning process moving through regions of parameter space in order of increasing complexity as measured by the local learning coefficient right and in fact that is exactly what's happening in the Deep linear networks right so the the SLT story about moving through the parameter space and the basian posterior undergoing phase transitions is exactly what's happening in the Deep linear Network so if you're willing to buy that generalization from that kind of corner of the the of deep learning to General behavior of newal networks then like I think you are in some sense already buying the SLT story of to some degree of of how learning is structured by looking for increasingly complex Solutions um okay but all of those are big question marks from a theoretical point of view I would say um okay putting that aside um what does SLT say about generalization well so it says that the ASM totic behavior of the generalization error is a function of the number of samples like at the very end of training let's say or the very end of The Bean learning process looks like the sort of irreducible loss plus a term that looks like Lambda on N where Lambda is the local learning coefficient so if you take that iral loss over the other side so the difference between sort of generalization error um and its minimum value uh behaves like 1/ n is proportional to 1/ n u and the the constant of proportionality is the local learning coefficient so that's like the Deep role of this geometric invariant this measure of complexity in the description of generalization error in in the bean setting now what that says in deep learning uh well as I said like taking that first part of that bridge between the two worlds for granted it would like to say something like the the test era when you're looking at a particular region of parameter space um is governed by the local learning coefficient uh except that the relation between n and training is unclear right so the exact way in which it governs test error is a function of how that bridge gets resolved so I think at a technical level it's difficult to say much precise at the moment um I don't think it's impossible it's just that very few people are working on this and it hasn't been getting enough attention to say more concrete things at a conceptual level uh it says that and this is this is the so this maybe starts to get into uh more interesting future work you can do taking the SLT perspective um but this relationship between the local learning Co efficient and how that is determined by lost landscape geometry and like generalization behavior um this is a very interesting link uh which uh I think is quite fundamental and interesting but I think I mean your your question is sort of going in the direction of um your skull's uh less wrong post is that right that's that's sort of what it was inspired by just this question of you know we're suppose we believe the story of we're getting we're gradually increasing complexity as measured by the local learning coefficient in this model class well what does that actually say you know in terms of objects that I cared about before I heard of singular learning theory you know what's that telling me in terms of things I care about of the behavior of these things um yeah so it could tell you things like uh suppose you know two solutions of your problem that are qualitatively different so you have a data generating process and you can think about it in two different ways and therefore model it in two different ways it uh would tell you potentially if you could estimate the local learning coefficient or derive it or have some method of knowing that one is lower than the other it could tell you things like one will be preferred by the beian posterior um now to the extent that that is related to what SGD finds that might tell you that uh training is more likely to prefer some class of solutions to another class um now if those parameters are just very different like completely different solutions somehow not nearby in parameter space maybe it's quite difficult to make the bridge between the way the beian posterior would prefer one or the other and what training will do because in that case the relationship ship between training and these two parameters is this very Global thing to do with the trajectory of training over large parts of the parameter space and very difficult perhaps to translate into a beian setting um but I think in cases where you sort of have two relatively similar Solutions maybe you you had a choice to make so during the training process you had one of two ways to sort of take the next step and accommodate some additional feature of the true distribution uh and those two different choices like differed in some complexity fashion that could be measured by the local learning coefficient so like one was more complex but you know lowered the loss by so much and the other one was simpler but didn't lower the loss quite as much um then you could make qualitative predictions for what the beijan posterior would prefer to do and then you could ask are those predictions also what SGD does either theoretically you could try and find Arguments for why that is true but it gives you an empirical prediction you can go and test um and well at least in some toy cases you know SGD training so in this toy model of superposition work we did um SD training kind of does seem to do the thing that the beian posterior wants to do that's very unclear in general but it gives you pretty like reasonable grounded predictions that you might then go and test uh which yeah I think is not nothing so that would be be I think like the most grounded thing you would do with the current state of things yeah and I guess it's sort of I guess it suggests a research program of try and understand which which kinds of solutions do have a lower learning coefficient which kinds of solutions have higher learning coefficients and you know just giving you a different handle on the problem of understanding what neural network training is going to produce um does that seem fair yeah I think uh I think a lot of these questions about the relation between the theory and practice uh our perspective on them will shift once we get more empirical evidence so what I expect will kind of happen is that these questions seem to loom rather large when we've got a lot of theory and not so much empirical evidence but if we go out and study many systems and we see you know local learning coefficients or restricted local learning coefficients doing very sort of stage-wise things and they correspond very nicely to the kind of structure that's developing as we can test independently with other metrics um then like I think it will start to seem a little bit academic whether or not it's provably the case that SGD training does the same thing as the beijan posterior just because this tool which to be clear the local learning coefficient if you look at the definition like has a sort of sensible interpretation in terms of like what's happening to the loss as you turb certain weights and you can tell a story about it that doesn't rely on uh like the link between the beiji and posterior and SGD training or something so I think to the degree that the empirical work succeeds I think people will probably take this sort of independent justification so to speak of the LLC as a quantity that sort of is interesting and think about it as a reflection of what's happening to the internal structure of the model um and then the mathematicians like myself will still be happy to go off and try and prove these things are like Justified but I don't I don't see this as necessarily being a roadblock to using it sort of quite extensively to study what's happening during training yeah fair enough um so I guess I'd like to ask some questions about thinking about slts compared to other potential theoretical approaches one could have to deep learning um so the first comparison I have is to neural tangent kernel approaches so the neural tangent kernel for listeners you don't know is basically this observation that in the limit of infinitely wide neural networks under a certain method of initializing networks there's this observation that networks that during training the parameters don't vary very much and because the parameters don't vary very much that means you can do this sort of mathematical trick and it turns out that your your learning is basically it's a type of Kernel learning which is essentially um linear aggression on a set of features and it luckily it turns out to be an infinite set of features and you can do it um I don't know how I was going to finish that sentence but uh it's it turns out to be feature learning on the set of features and you can figure out what those features are supposed to be based on what your model looks like you know how many what kinds of nonlinearities you're using um and there's you know there's some family of theory trying to understand okay what does the neurot tangent kernel of various types of models look like um how close are we to the neurot tangent kernel and if you believe in the neural tangent kernel story you can talk about oh you know the reason that neural networks generalize is that you know the the neural tangent kernel it tends to learn certain kinds of features before other kinds of features maybe those kinds of features are simpler um it seems plausible that you could um do some story about phase Transitions and you know it's a mathematically rigorous story um so I'm wondering how do you think the singular learning theory approach of understanding deep learning compares to the neural tangent colel style approach yeah good question I think I'm um not an expert enough on the ntk to give a very thorough comparison um but I'll I'll do my best so uh let me say first the places in which I I understand the the ntk says very deep and interesting things so I think it seems that uh this work on the me parameterization uh seems very successful so add initialization when this uh taking the limit to infinite width is quite Justified because the weights really are independent um this seems like uh probably the principal success of deep learning theory to the extent there are any successes the study of that limit and how it allows you to choose hyperparameters um for learning rates and other things again I'm not an expert but that's my understanding of how it's used and that seems to be quite widely used in practice uh as far as I know um so that that's been a great success of Theory uh I don't think I believe in statements outside of uh that initial phase of learning though so I think there the as far as I understand it the claims to applicability of the ntk methods become hypotheses unless you then perturb away from the gaussian process limit so the Deep parts of that literature seem to me to be accepting the position that in the infinite width limit okay you get some Gan process that isn't actually a good description of the training process away from initialization uh but then you can perturb back in basically higher order terms um in the exponent of some um distribution you can put in higher order terms uh and study systematically those terms to get back to finite width so attempt to perturb away from infinite width back to finite width and accommodate those contributions in some fashion and uh you can do that with tools from random Matrix Theory and gaussian processes and that looks a lot like what people do in uh sort of ukian quantum field Theory so people have been applying techniques from that world to do that uh and I think they can say non-trivial things um but I think it is overselling it to say that that is a theory on the same sort of level of mathematical rigor and depth as s T um so I don't think it says things about the basian posterior and its astics in the way that SLT does I think it's aiming at rather different statements um and I think at least in my judgment at the moment I I think it's has a little bit of the flavor of um saying qualitative things rather than quantitative things again like this is you know this is my my outside as impression and I could be could be wrong about what the state of things is there but uh I would say that one part of that story that I have looked at a little bit is the work that my colleague Liam hodkinson has done here so they have some very interesting recent work on um uh information Criterion and overparameterized models I think the title is something like that um partly inspired by what and ab's work I think um looking at trying to take a sort of I mean not only ntk but this sort of General um sort of approach point of view to doing things like what the free energy formula in SLT does um and so I think that's quite interesting uh I have my differences of opinion with Liam about some aspects of that but I think it's not like uh you know mathematics isn't actually divided into camps that disagree with one another or something right so if things are both true then they meet somewhere and I can easily imagine that if the uh so the geometric heart of SL so SLT is sort of made up of two pieces one of which is using resolution of singularities to do um laplus integrals sort of oscillatory integrals and the other is uh dealing with empirical processes that intervene in sort of that uh when you try to put it in the context of statistics and I don't think these kinds of oscillatory integrals these techniques have been used systematically by the people doing ntk like stuff or ukian field Theory like stuff um but I think that if you took those techniques and used them in the context of the random Matrix Theory that's going on there uh you'd probably like find that the perturbations that they're trying to do uh can be like linked up with SLT somewhere so I mean I think it all probably fits together eventually but right now they're quite separated fair enough um so related question I have is one observation I have from the little I know about the Deep learning theory literature is the variance of the distribution of how parameters are initialized matters so one example of this is in deep linear models um if your initialization distribution of has high enough Vari then it looks something like the ntk you only have a small distance until the optimum uh whereas if it's uh very if all the parameters are really really close to zero at initialization you have this like jumping between saddle points Y and in deep networks you know at high initialization or at one initialization you have this neural tangent kernel story which crucially doesn't really involve learning features it has like a fixed set of features and you decide which ones to use yeah if you differ the variance the initialization then you know you start doing feature learning and that seems qualitatively different um if I think about how I would translate that to a singular learning theory story at least in general when people talk about um ban or you know beian stories of gradient descent often people think of the prior as being the initialization distribution um and in the in the free energy formula of singular learning theory you know the the place where loss comes up and then the learning coefficient comes up um the prior comes in at this like order one term that matters you know not very much basically so well late in training I mean for late in the process late in training yeah yeah yeah yeah so I guess my question is is singular learning theory going to have something to say about these initialization distribution effect I haven't thought about it at all so this is really answering this question tab harasser but uh okay so if you look at um yeah I would say that like the uh the what from the ASM totic point of view yeah I guess we tend not to care about the prior so this isn't a question that we tend to think about too much so far um so that's why I haven't thought about it but uh if you look at our model on the to model of superp position where you can really at least try and estimate like the kind of order end term in the ASM totic the log n ter term in the ASM totic and then these lower order terms maybe I should say what this ASM totic is so if you take the the beian posterior probability that's assigned to a region of parameter space um and negative its logarithm so that's an increasing function so you could basically think about it as telling you how probable a given region is according to the posterior um you can give an ASM totic expansion for that in terms of n so for large n it looks like n * some number which is kind of the average loss in that region or something like that uh plus the local learning coefficient time log n plus lower order terms the lower order terms we don't understand very well but there's definitely a constant order term contributed from the integral of the prior over that region now if you look at the toy model of superposition um that constant order term is not insignificant at the number like the scale of n at which we're running our experiments so it does have an influence and I easily imagine that this accounts for the kind of phenomena you're talking about in DN um so mathematician friend of mine Simon theolier who's an algebraic geometer and sort of become SLT ped maybe um has been uh looking at a lot of uh geometric questions in SLT and was asking me about this at some point um and yeah I I guess I would speculate that if you just Incorporated a constant term from those differences in uh initialization uh that that would account for this kind of effect um I think we're I mean maybe later in the year we'll write a paper about DNS uh at the moment we don't have complete understanding of the local learning coefficients for the non like away from the global minimum the local learning coefficients of the level sets as sort of I think we probably are close to understanding that but there's a bit of an obstacle to completely answering that question at the moment but I think principle I think that would be incorporated via the constant auditor yeah sure which to be clear like not change the behavior at the very large end but like for quite a large for some significant range of ends potentially including the ones you're typically looking at in experiments that constant order term could bias sumary against others in a way that explains the differences yeah and I guess I guess there's also a thing where the constant order term in this case in this case the expansion is youve got this term times n you've got this term the logarithm of n you got this term times the logarithm of the logarithm of n if I remember correctly and then you have these constant things and you know the logarithm of the logarithm of n is like very small right um so it seems like kind of easy for the constant order term to be like more important than that although potentially as important as logarithm of n uh yeah although that that log log n term is is very tricky so the multiplicity AO yagi's proof I mean as I said like she understands deep linear networks and in particular understands the multiplicity the coefficient of this log log in term up to a minus one um and yeah this can get like if I remember correctly as a function of the depth it like has this this kind of behavior and it becomes larger and Like A Bouncing Behavior with larger bounces that's right interesting yeah so that's that's very wild and interesting and uh yeah one of the things Simon is interested in is trying to understand geometrically like obviously AO yagi's proof is geometric derivation of that quantity but from a different perspective so I think it's at least we may maybe maybe aoyagi has a very clear conceptual understanding of what this balancing is about but I don't so anyway the log log n term remains a bit mysterious but uh yeah like it's not like at some if you're not varying the depth and you have a fixed depth maybe it is indeed the case that the constant order terms could be playing a significant role yeah sure right so I guess a final question I have before I get into the relationship between single learning the and um exential risk from AI I'm I'm more familiar with work done applying singular learning theory to deep learning is there much work outside that of you know the singular learning theory of you know other all the things people do outside my department yes uh yeah I mean that's where the theory has been concentrated I would say so sure uh I mean I don't want to give the impression that watab didn't think about neural networks so indeed like the class of models based on neural networks was one of the original motivations for him developing SLT uh and it's been talking about neural networks from the beginning uh so early that the state-of-the-art neural networks had tan H nonlinearities right so that's how long wab is talking about neural networks so uh yeah watab has been you know 20 years ahead of his time or something but having said that uh deeper neural networks with nonlinearities remain something that we don't have a lot of theoretical knowledge about there are some recent results giving upper bounds for various quantities but in general you know we don't understand deepen neural networks in SLT the predominant uh theoretical work has been done for singular models that are not neural networks um so uh various kinds of Matrix factorization um this some interesting work by zerck and collaborators looking at various kinds of um graphical models so so trees deriving learning coefficients for U probabilistic graphical models that have certain kinds of um graphs there's papers on latent dis allocation if that's the correct expansion of the acronym LDA um uh many many papers dozens uh I think I wouldn't be able to list all the relevant models here but yeah there's quite a rich literature um out there over the last several decades looking at other kinds of models here all right um yeah so I guess at this stage I'd like to move on to so my experience of single learning theory is I'm in this AI exential risk space for a while people are chugging along doing their own thing then at one effective altruism Global I have this meeting with this guy called Jesse hogland who like says oh I'm interested in this like weird math Theory and I tell him like oh yeah you know that's that's nice you know do your follow your dreams and then it seems like at some point in 2023 it's like all everyone's talking about like single learning theory it's like the the key to everything you know we we're all going to do single learning theory now it's gonna be amazing how did that happen like like what's the story whereby someone doing singular learning theory gets interested in AI Mentor the reverse uh yeah I guess I can't speak to the reverse so much although I can try and channel uh Alexander and Jesse and Stan a little bit sure yeah uh well I guess I can give a brief run through of my story of going from mean I cared about SLT before I cared about alignment um so maybe I'll say briefly why I came to care about SLT um so I'm algebraic geometer by training uh so I spent decades thinking about derived categories in algebraic geometry and some mathematical physics of string theory and its intersection with algebraic geometry Etc um and then I spent a number of years thinking about um linear logic and uh which might seem unrelated to that but has some geometric connections as well uh and then because of some influence of uh friends and colleagues at UCLA where I was a postto I sort of paid attention to deep learning when it was taking off again in 2012 2013 2014 and sort of thought I mean I'd always been a programmer and interested in computer science in various ways and sort of thought that was cool uh and then uh sort of Saw alphao happen and then the original scaling laws paper from hnus at all and it's when I saw those two so alphago and the hus at all paper that I was like huh well maybe this isn't like just some like interesting engineering thing but maybe there's actually some deep scientific content here that I might like think about seriously rather than just kind of spectating on an interesting development somewhere else uh in the intellectual world so I sort of uh cast around for ways of trying trying to get my hands on with the mathematical tools that I had on what was going on in uh deep learning um and that's like when I opened up wat ab's book um algebraic geometry and statistical learning theory um which seemed designed to nerd snipe me because it was telling me like geometry is useful for doing statistics uh and then when I first opened it I thought that can't possibly be true this is like some kind of crazy Theory and I closed the book and put it away and looked at other things and then came back to it eventually um okay so that's my uh sort of story of getting into SLT um from the point of view of wanting to understand kind of universal mathematical phenomena in large scale learning machines that's sort of my primary intellectual interest in the story so I've been chugging away at that a little bit uh um you know when I first started looking at SLT it was uh I apart from shwe Lynn who did his PhD in SLT uh in the states I believe with burn stells um uh you know mostly it's watab his students and a few collaborators mostly in Japan a few people elsewhere very small community uh so I was sort of sitting here in Melbourne chugging away reading this book and had a few students and then um Alexander alenz uh found me and asked me you know what what this could say about alignment uh if anything and uh at the time I found it very difficult to see that there was anything SLT could say about alignment I guess because the as a mathematician the parts of the alignment literature that I sort of immediately found comprehensible were things like Vanessa Co work or Scott garbrandt's work these were sort of made sense to me but these seem quite far from statistical learning theory at least the parts uh that I understood and so uh I think my answer originally to Alexander was no I don't think it is useful um for alignment but I reading more about the alignment problem and being already very familiar with capabilities progress and believing that there was something deep and Universal going on that that capabilties progress was sort of latching onto but it not being like some contingent phenomena on like having a sequence of very complex engineering ideas but more like throw simple you know scaling and other things at this problem and things will continue to improve so that combination of believing in the capabilities progress and more deeply understanding what I was reading in the alignment literature about the problem um okay the product of that was me taking this problem seriously enough to think that okay maybe my initial answer uh you know could profit from thinking a little bit more extensively about it so yeah I did that and outlined some of the ideas I had about how this kind of stage-wise learning or phases and phase transitions that the beian learning process in SLT talks about how that might be by analogy with the developmental biology used to understand how structure develops in neural networks um so I had some preliminary ideas around that like sort of middle of 2023 uh and um those ideas uh were developed further by Alexander and Jesse hookland and Stan Van wingarden and various of my students and others uh and that's that's where this developmental interpretability agenda came from and I think that's sort of around the time you ran into SLT if I remember correctly and yeah I yeah the time I run into it is um so I don't know I hear a few different people mention it um including people listen to the episode of this podcast with Quinton Pope he brings it up um and it sounds interesting some other people bring it up that sounds interesting and then I hear that you guys are running some sort of uh summer school thing of like you know a week where you know you can listen to lectures on singular learning theory and I'm like oh I could I could take a week off to you know listen to some lectures it seems kind of interesting um and so yeah this this is a summer of 2023 and yeah people can still this lectures are still up on YouTube so you can hear some guy ask kind of basic questions um that's me yeah yeah I guess it took me a while to appreciate some of the things that I mean I guess uh John Wentworth has also been posting in various places how he sees SLT relating to some of the aspects of the alignment problem that he cares about um I think it has taken me a while to so now I see I guess more clearly why uh some of the very core problems in alignment things like sharp left turns and so on um how the way that people sort of conceptualize them how l t when you first hear about it might map onto that in a way that makes you think it could potentially be interesting um I think uh yeah my my initial take being negative was mostly to do with it just being such a big gap at that time like the middle of last year uh between you know SLT being a very highly theoretical topic I mean okay I should should be clear so the wbic which is kind of uh the widely applicable beian information Criterion which is a piece of mathematics and statistics that watab developed um has been very widely used uh in in places where the Bic sort of typical like this is not an esoteric weird mathematical object this is a tool that you know statisticians use uh in sort of um in the real world as they say um the WB has been used in that way as well and so this uh the work we've been doing say with you know the local learning coefficient and sgld and so on is by far not like the the only place where SLT has sort of met applications that that's not the case I don't want to give that impression so um but like the way SLT felt to me at that time was there's just so many questions about whether this the basion learning process is related to SGD training uh all these other things we were discussing um so I think it was I mean it was quite a speculative proposal to study the development process using these techniques middle of last year um I think you know we've been hard at work over the last year seeing if a lot of those things pan out and they seem to so I think it's I think it's much less speculative now to imagine that SLT says useful things about at least about stagewise development in neural networks um I think it says more than that uh about questions of generalization that are alignment relevant but I think it was uh appropriate a year ago to um to think that there was some road to walk before it was clear that this piece of mathematics was uh not a nerd sniper sure um so so okay the story involves like at some point this guy Alex Olden reaches out to you and says hey how is singular learning theory relevant to alignment and instead of deleting that email you like spend some time thinking about it um why well I should insert a little anecdote here which is I think I did ignore his first email okay not because I read it and thought he was a lunatic but just because I don't always uh get to every email that's sent to me sure he persisted to his credit so yeah anyway go yeah I guess why yeah why did it feel interesting to you or why did you end up um pursuing the alignment angle yeah uh I mean I had read about I had read some of this literature before in a sort of curious but it's not my department kind of way so right I mean I quite extensively read norber vena's work I'm a big fan of Vina um and he he's written extensively about you know mean God and Gollum and the human use of human beings and elsewhere uh precisely about the control problem or alignment problem um in much the same way as as modern authors do and so I I guess I had thought about that and seen that as a sort of pretty serious problem uh but not pressing because AI didn't work um and then I suppose I came to believe that AI was going to work in some sense and held these two beliefs but in different parts of my brain um and it was Alexander that sort of caused the cognitive dissonance the resolution of which was me actually thinking more about this problem um so that's that's one aspect of it just causing me to try and make my beliefs about things coherent um but I think that wouldn't have been sufficient without a second ingredient and the second ingredient was uh okay to theg Dee you assign a probability to something like AGI happening in a relatively short period of time it uh it has to affect your motivational system for doing long-term fundamental work like mathematics so um so it's a kind of personal comment the reason I do mathematics is not uh it's not like based on some competitive Spirit or trying to solve tricky problems or something like that very much motivated as a mathematician by uh you know the image of some kind of collective effort of the human species to understand the world and you know I'm not Witten or kavich or grow Andi or somebody but you know I'll put my little brick in the wall and if if I don't do it then maybe it'll be decades before somebody does this particular thing so I'm moving that moment forward in time and I feel like that's a valid use of my energies and efforts and so on and I'll teach other people and train students to do that kind of thing and I felt that was a very worthwhile Endeavor to spend my life professionally on um but if you believe that there are going to be systems around in 10 years 20 years like I mean 30 years it doesn't really matter right because mathematics is such a long-term Endeavor um if you believe that at some time soonish systems will be around that will do all that for five cents of electricity and you know in 20 seconds then like it it has to change if that is your motivation for doing mathematics it has to change your sense of how worthwhile that is as it you know because it invol involves many trade-offs against other things you could do and other things you find important um so I actually found it quite difficult to continue doing the work I was doing um the more I thought about this and the more I believed in things like scaling laws and the fact that these systems do seem to understand what they're doing and there's interesting internal structures and something going on we don't understand um so I'd already begun shifting to studying the universal phenomena involved in learning machines like from a geometric perspective and picked up statistics and empirical processes and all that i' already started to find that more motivating than the kind of mathematics I was doing before um and then so it wasn't such a big jump from that to uh being motivated by alignment and seeing a pathway to making use of that comparative advantage in theory and Mathematics and seeing how that might be applicable to make a contribution to that problem um so you that's sort of roughly speaking as many details and many personal conversations with people that helped me to get to that point in particular my former Master student Matt farugia Roberts who was like sort of in my Orbit probably the person who cared about alignment the most do I talk to the most about it um so that's that's kind of what um led me to to where I am now which is yeah I guess the most of my research work is now motivated by applications to alignment yeah sure so I guess my next question is sort of concretely what do you think it would look like for single learning theory to be useful in the project of analyzing or preventing exential risk from AI mhm yeah so the the pathway to doing that that we're currently working on is um providing some sort of uh rigorously founded empirical tools for understanding how structure gets into neural networks um and that has similar payoffs as many things as interpretability might and also potentially some of the same kind of drawbacks um so I can talk about that in more detail but maybe it's better to sort of uh sketch out at a very high level the class of things that theories like SLT might say and which uh seem related to the core problems in alignment um sure then we can talk about you know some detailed potential applications um so I rather like the framing that uh Nate sores gave in a blog post he wrote in 2022 I think uh I don't know if that's the post that introduced the term sharp left turn but it's where I I learned about it um so let me give a a framing of like what uh so sorz calls the sort of core technical problem in alignment and which I I tend to agree seems to me like the core problem uh I'll say it in a way which I think captures what he's saying but is sort of my my own language um okay so if we look at uh the way that large scale neural networks are developing uh so they become more and more competent with scale both in parameters and data and it seems like there's something kind of universal about that process what exactly that is we don't quite know but many models seem to learn quite similar representations and there are consistencies across scale and across different runs of the training process that seem hard to explain if there isn't something Universal uh so then well what what is in common between all these different uh training processes well it's the data right so I guess many people sort of are coming to a belief that uh structure in the data whatever that means um is quite strongly determinant of the structures that end up in trained networks whatever you take that to mean circuits or whatever you like um okay so then uh from that point of view what sorz says is uh so his terms are generaliz capabilities generalize further than alignment and the way I would put that is if your approach to alignment is engineering the data distribution so things like rhf or safety fine-tuning and so on uh fundamentally look like training with modified data that tries to get the Network to do the thing you want it to do so if we just take as a broad class of approaches engineer the data distribution to try and arrange the resulting Network to have properties you like so if that's your approach then you uh have to be rather concerned with which patterns in the data get written more deeply into the model because if and Sor as this example is arithmetic right so if you look in the world there are many patterns that are explained by arithmetic so you will expect models to I don't think this is how current models learn arithmetic but you could imagine few future multimodal models just looking at many scenes in the world and sort of learning to count and then learning rules of arithmetic etc etc so anyway there are some patterns in the world that are very deep and fundamental and explain many different samples that you might see and if this is a universal phenomena as I believe it is uh that the data determined structure in the models then patterns that are represented more deeply in the world will tend perhaps to get inscribed more deeply into the models now that's a theoretical question right so that's like one of the questions you might study from a theoretical lens is that actually the case but this story with dlns and learning modes of the data distribution in order of their singular values and all all that tends to suggest that this is on the right track and I think SLT has something more General to say about that I can come back to that later but I sort of buy this General perspective that in the data there are patterns not all patterns are equal some are more frequent than others some are sort of deeper than others in the sense that they explain more and capabilities and whatever that means but like reasoning and planning and the things that instrumental convergence wants to talk about models converging too these kinds of things might be patterns that are very deeply represented whereas the sort of things you're inserting into the data distribution to get the models to do what you want the kind of things that you're doing with rhf um for example might not be as like uh as primary as those other patterns and therefore the way they get written into the model in the end might be more fragile and then when there's a large shift in the data distribution say from training to deployment or uh I mean however you want to think about that so if there's a shift in the data distribution how do you know which of those structures in your model Associated to which structures in the data distribution are going to break and which ones will not right which ones are which ones are sacrificed by the model in order to retain performance um well maybe it's the ones that are shallower rather than the one that are deeper and on that theory capabilities generalize further than alignment so I think that post is sometimes criticized by its emphasis on like The evolutionary perspective on human like uh like the contrast between in lifetime human behavior and what evolution is trying to get people to do and so on but I think that's missing the point to some degree I think this General perspective of structure in the data determining structure in the models not all structure being equal and our alignment attempts if they go through structuring the data perhaps being outcompeted by structures in the data that are deeper when it comes to what happens when data distributions shift I think this is a a very sensible very grounded quite deep perspective on this problem uh which as as a mathematician makes a lot of sense to me so I think this is a a very clear identification of a fundamental problem um I mean Invasion statistics uh even absent a concern about alignment but it does seem to me to be quite a serious problem if you're attempting to do alignment by engineering the data distribution so that's I think my main line sort of interest is in approaching that problem uh and well we can talk about how you might do that obviously it's a difficult and deep problem empirically theoretically and so we're sort of building up to that in various ways but I think that uh that is the core problem that needs to be solved yeah sure I guess I guess yeah if you put it like that it's not obvious to me what it would look like for singular learning theory to address this right sure um like maybe it suggests something about like understanding patterns and data and which ones are more fundamental or not uh but but it's I know that's a very rough guess yeah so uh I can lay out a kind of uh story of how that might look obviously this is a motivating story but not one that has a lot of support right now I I can say the ingredients that lead into me thinking that story is like has some content to it Okay so we've been studying for the last year how the training process looks in models of various sizes right and what SLT says about that um and part of the reason for doing that is because uh we think and this is not I mean other people have independent reasons for thinking this uh but like from an SLT perspective we think that the training process the structure of the training process or learning process uh reflects the structure of the data um what things are in it what's important what's not so if it's correct that the structure of the data is somehow revealed in the structure of the learning process and that also sort of informs the structure that the uh the internal structures in the model that sort of emerge and then affect later structure and then are present in the final model um so that starts to give you some insight into not only how like the mechanism by which structures in the data become structures in the model so that link if you don't have that link you can't really do much right I mean so if you can understand how structure in the data becomes structures say circuits or whatever in in the final model uh so that's that's already something um then if you also understand the sort of relative hierarchy of importance like how would you measure that I mean there's several things you'd want to do in order to get at this question right you you'd want to be able to first of all know what the structure in the data is well unfortunately like training networks is probably the best way to find out what the structure in the data is um but suppose you've trained a network which sort of is a reflection you know it's sort of like holding a mirror up to the data um and you you get a bunch of structure in that model well then you're just looking at a big big list of circuits well how do you tell which which kinds of structure are associated to deep things in the data which are very robust and will survive under large scale pertubations uh and very fragile structures that are somewhat less likely to survive perturbations in the data distribution if you're to like keep training or expose the network to to further learning um well those are questions I mean then there's a question of like stability of structure and how that relates to things you can measure but these are fundamentally geometric questions from our point of view so I think it actually is in scope for SLT to uh not right now but there are directions of development of the theory of SLT that augment the invariance like the local learning coefficient and the singular fluctuation with other invariant you could attempt to estimate from data which uh you could associate to these structures as you watch them emerging and which measures for example how robust they are to certain kinds of perturbations in the data distribution so that you get some idea of not only what structure is in the model but what is kind of deep and what is shallow and how that pays off for alignment exactly I mean I guess it's hard to say right now but this seems like the the kind of understanding you would need to have if you were to deal with this problem of generalization of capabilities outpacing alignment if you were to have you know empirical and theoretical tools for talking about this sensibly you'd at least have to do those things it seems to me so that's how I would see concretely uh I mean we have ideas for how to do all those things but uh you know it's still very early uh the part that we sort of understand better is the correspondence between structure in the data and development and the stages and how those those stages do have some geometric content that's what the changes in the local learning coefficient says so all of that points in some direction that you know makes me think that the story I was just telling has some content to it but uh yeah that is that is how I like the optimistic story of how SLT might be applied to solve eventually or be part of the solution to that problem um that we're working to towards yeah sure so I guess yeah if I think about what this looks like concretely so one version of it is this developmental interpretability style approach of you know understanding you know are there phas transitions in models like at what points do models like really start learning A Thing versus a different thing and then um I also see some work kind trying to think about what I would think of as in biases so in particular there's this uh less wrong post is that undi I don't know if you posted it elsewhere but um there's this thing you posted about unified yes it was a less wrong post uh something about like uh I guess you call it short versus simple yeah short versus simple um you know thinking about um a kind of single learning theory perspective on you know just learning codes of Turing machines that are generating data and saying something Beyond just like number of symbols in the code perhaps you want to explain that a little bit more for the audience sure uh yeah so there's been an interesting thread I mean within the alignment literature I think uh if I'm correct going back to Cristiano writing about um ghosts in the Solomon of Prior or something uh and then Evan hubinger wrote quite a bit about this and and others um which is motivated by the observation that if you're producing very capable systems by a dynamical process of training uh and you want to prove things about the resulting process or maybe that's too ambitious but at least understand something about the resulting process and its endpoint uh then you might like to know what kind of things that process typically produces uh which is what inductive biases means um and well neural networks are not touring machines uh but we have some understanding of you know certain kinds of distributions over touring machine codes and uh there's a kind of aam's Razor principle there which is like spiritually related to the free energy formula that we were discussing earlier although not directly analogous without making some additional choices um but anyway so the story about inductive biases and its role in alignment um has been going on for a number of years um and there's been you know I think quite reasonably some discussion that's critical of that in recent months on L wrong and maybe I'll my post sort of came out of reading that a little bit so let me maybe just characterize briefly what the discussion is for some context so we don't understand the inductive bias of SGD training we know some bits and pieces uh but we really don't understand systematically what that bias is we do not understand that it's a bias towards low Comm growth complexity functions uh there are some papers pointing in that Direction I I don't think they conclusively uh established that so I think we are just quite in the dark about what the inductive biases of SGD training are and I read these posts from say Christiano and hubinger as saying well here we know about the inductive biases sort of in some nearby conceptually similar thing and if that knowledge could be used to reason about SGD training then here would be the consequences and these look potentially concerning from an alignment perspective and my model of both Cristiano and hubinger is that I think neither of them would claim those are Ironclad arguments because there's a big leap there um but it seems like sufficient to motivate further research empirically which is what for example hinger has been doing with this sleeper agents work so I think that's very interesting and I kind of buy that um but with the Big C that you know there is this sort of Gap there that it isn't like on solid theoretical ground and and then you can criticize that work and say that it's kind of spinning stories about how scary inductive biases are there was a some posts from Nora belrose and Quinton Pope um sort of critiquing the um if you take uncritically this story about inductive biases Without Really internal izing the fact that there is this big gap in there um then you might make overconfident claims about uh what the consequences of inductive biases us may be um so in some sense I think both sides are correct uh like I think it's reasonable to look at this and think ah this might tell us something and so I'll go away and do empirical work to see if if that's true I think it's also accurate to think that people may have become a little bit overly spooked by our current understanding of inductive biases okay so in that context what I wanted to do with this post was uh to point out that as far as our current state-of-the-art knowledge about Bean statistics goes which is SLT uh the at least if by inductive bias you means which parameters does the beian posterior prefer this is not description length it's not even like description length it's just something else and we don't know what that is yet but like this step that Cristiano and hingo were making from thinking about description length and inductive biases and SGD training as may be being related I'm pointing to a particular piece of that Gap where I see that this is not justified now I think that maybe the concern that they derive from that connection may still be justified but I think thinking about it roughly as description length is simply wrong and then I gave a particular example in that post of not a neural networks but in a sort of uh sort of touring machine oriented setting of how um the local learning coefficient which in some cases like this simple situation we were describing at the beginning of this podcast where you sort of have energy levels and then there's sums of squares and the local learning coefficient is just the number of squares which is sort of the co-dimension um so that's that's somewhat like description length so if you have a system where the LLC local learning coefficient is basically half the number of variables you need to specify your thing then that is description things right because you take your Universal TR machine it's got a code tape and you need n squares to specify your code well that's roughly speaking n variables whose value you need to specify and you need to that value value to stay close to the value specified and not wander off in order to execute the correct program so there is quite a legitimate rigorous connection between description length and the local learning coefficient in the case where the like you're dealing with models that have this kind of near regularity behavior that the loss function is just locally sums of squares um but it's typical as soon as you perturb this kind of universal true machine perspective and introduce some stochasticity uh that the local learning coefficient becomes immediately more exotic and includes for example a bias towards error correction which I'd present in the following way like uh if you give someone some instructions like it's no good those instructions being short if they're so fragile that they can't execute them reliably right so there's actually like some advantage to trading off uh succinctness against robustness to like errors in execution right where you don't have to get everything perfect uh and you'll still more or less get what you want um and there's some precise mathematical statement of that in that post uh and so that's in the setting of true machines um so it's provably the case that there will be uh some preference for chewing machines which are insensitive to certain kinds of Errors if they're executed in this in some slightly exotic way so like this this setting really is not meant to be thought of as directly analogous to what's happening in neural networks but I think there's a higher level conceptual inside which uh as I sort of noted noticed after I thought of those ideas uh Ong with my student w triani at a meeting we had in Wham that was organized by Alexander and Stan and Jesse um uh and so there was a some lineal logic people there and I was talking with them about this um and had this idea with Will about error correction and then later I tweaked that uh you know there is a phenomena in neural networks these backup heads or um where it does seem that neural networks May actually have a bias towards reliably Computing Computing important things by making sure that if some weight is perturbed in such a way that it takes out a certain head that another head will compensate so I'm speculating now but like when I see that sort of phenomena that makes sense to me right like it does make sense it's a general principle of beian uh inference sorry just a general principle of bean statistics that uh short is not necessarily better uh like degenerate is better and degenerate can be both short but also like redundant right so I guess this points to me this points just sort of a qualitatively different way that singular learning theory could be useful where you know one way is kind of understanding um developmental stages and the how structure gets learned you know over time with data and I guess there's this other approach which is like better understanding what kinds of solutions ban inference is going to prefer in these sorts of messy systems and maybe that helps inform arguments that people tend to have about um you know what sorts of nasty Solutions should we expect to get um does that seem fair to you yeah I think so um yeah I guess this this observation about the inductive biases sort of been on the side or something because we've been busy with other things but I I do I do think that uh so one of the things that my former student Matt fuger Roberts who I mentioned earlier and um and potentially others uh I don't know if Garrett Baker is interested in this but he and Matt are working on an RL project right now that maybe eventually develops in this direction um yeah you could imagine that uh in a system that is doing reinforcement learning that potentially uh some of these inductive biases if they exist in neural networks and that's still speculation but if this observation I'm making about this other setting with trie machines if this inductive bias towards error correction or robustness uh is universal um then you could imagine that uh like this is actually a pretty significant factor in things like um RL agents choosing certain kinds of solutions over others because they're generally more robust to perturbations in their weights uh things like making your environment safe for you to make mistakes um so yeah that's that's speculation but I I do think that I I agree that this is a sort of independent direction in which uh potentially you can derive high level kind of principles from some of these mathematical ideas that that would be useful y fair enough um so another question I have about this interplay between single learning theory and AI alignment a exential risk is so a lot of people in the field use this kind of simplified model where there are some people working on making AI more generally capable and therefore more able to cause doom and there are other people who are working on help making sure AI doesn't cause doom and you know when you're evaluating some piece of research you know uh you've got to ask you know to what extent does it Advance capabilities versus alignment and if it advances capabilities much more than alignment then maybe you're you think it's bad or you're not very excited about it um so with singular learning theory you know one might make the that like well if we have this better of learning um it seems like this is just going toally be useful um and maybe it's you know about as useful for causing Doom as for preventing Doom or maybe it's more useful for causing Doom than for preventing Doom yeah and therefore people on the anti- Doom side should just like steer clear of it um I'm wondering what you think about that kind of argument yeah uh it's a good question I think it's a very difficult question to think about properly I have talked with many people about it I've uh me not only on my own but along with Alexander and Jesse and Stan and and the other folks at tus talked about this quite a bit um talked with Lucius pushak about it and some of the junior Mary folks um so I I think I've attempted to think about this pretty carefully but I still remain very uncertain as to how to compute on these tradeoffs um partly because especially this kind of research uh I mean Empirical research I suppose you know you partly get out about as much as you put in or something I don't know like you you have you have a certain number of experiments you get a certain bits number of bits of insight but Theory sometimes doesn't work like that right you you crack something and then lots and lots of things become visible and like there's a there's a nonlinear relationship between a piece of theory and the number of experiments it kind of explains right um so yeah I mean my answer to this question like could look extremely foolish just six months from now if a certain direction opens up and then it's just very clearly the trade-off is not what I thought it was um sure so I I guess one response to this question would be that uh we have prioritized thinking about directions within the theory that we think have a good sort of tradeoff in this direction um and for the things we're currently thinking about I just don't see how The Rao of uh contribution to alignment to contribution to capabilities uh is too small to justify doing it um so we are thinking about it and taking it seriously uh but I I don't actually have a very systematic way of uh dealing with this question I would say even at this point um so but I think that applies to many things you might do on a technical front right so I uh I guess my model is something like and and here I think Alexander and I differ a little so maybe I'll introduce Alexander's position just as sort of to provide context so I think if you have a position that capabilities progress will get stuck somewhere um for example perhaps it will get stuck I mean maybe the main way in which people imagine it might get stuck is that there's some fundamental gap between the kind of reasoning that can be easily represented in current models um and the kind of reasoning that we do and that you need some genuine insight into something involved architecture or training processes or data whatever you need some genuine insight to get you all the way to AGI and there's some threshold there and uh that's between us and the and the the sort of Doom um if there is such a threshold then conceivably you get unstuck by having better theory of how Universal learning machines work and the relationship between data and structure and then you can reverse engineer that to design better architectures so I guess that's kind of pretty obviously the mainline way in which SLT has a disproportionate negative impact on um could have a negative impact uh if on the other hand you think that you know basically not too much more is required nothing deep um then it's sort of like capabilities are going to get there anyway and the marginal negative contribution from doing more theoretical research seems like uh not that important um so I think that seems to me the major divide so I think in the latter world where you sort of see systems more or less getting to dangerous levels of capability without much deeper Insight then I think that SLT research uh I'm not that concerned about it I think it's just broadly uh one should still be careful and you know maybe not prioriti prioritizing certain Avenues of investigation that seem disproportionately potentially likely to contribute to capabilities but on the whole I think it doesn't feel that risky to me in the former case where there really is going to be a threshold that needs to be cracked with um with more theoretical progress then it's more mixed um I guess I would like to ER on the side of uh well my model is something like it would be extremely embarrassing to get to the point of facing Doom uh and then be handed the solution sheet which showed that actually it wasn't that difficult to avert you just like needed some reasonably small number of people to think hard about something for a few years uh that seems like pretty pathetic uh and we don't know that we're not in that situation so I mean as sorz was saying in this post you know he also at least at that time thought it wasn't like alignment was impossible but rather just like a very difficult problem you need a lot of people thinking hard about for some period of time to solve um and I would it seems to me we should like try sure and absent a very strong argument for uh why it's really dangerous to try I think we should go ahead and try um but I think yeah if we do hit a plateau and it does seem like theoretical progress is likely to critically contribute to unlocking that I I think we would have to re-evaluate that tradeoff yeah yeah I guess I guess I wonder it seems like you care both about whether there's some sort of theoretical Blocker on the capability side and also whether there's some theoretical Blocker on the alignment side right yes so like if there's one on the alignment side but not on the capability side then you're really interested in theory if there's one on the capability side but not on the alignment side then you want to like arase knowledge of linear algebra from the world or something yeah um you know not not really and then if there's both or neither then you know you've got to think harder about relative rates I guess that would be my guess yeah I think that's a nice way of putting it I mean I think the evidence so far is that uh the capabilities progress requires essentially no Theory uh whereas alignment progress seems to so far not have benefited tremendously from empirical work I mean I guess it's fair to say that yeah the big labs are pushing hard on that and believe in that and I don't know that they're wrong about that but my suspicion is that these are two different kinds of problems and I do see this is actually a kind of bit of a group think error in my view in the more pric alignment strategy which is I think a lot of people in computer science and related fields think maybe not consciously but unconsciously feel like deep learning has succeeded because humans are clever and we've made the things work or something I think many clever people have been involved but I don't think it worked because people were clever I think it worked because it was in some sense easy right I think that large scale learning machines want to work uh and if you just do some relatively sensible things I mean not to undersell the contributions of all the people in deep learning and I and I have a lot of respect for them but compared to I mean I've worked deep areas of mathematics and also in collaboration with physicists like the the depth of the theory and understanding required to unlock certain advances in those fields like this is not we're not talking about that level of complexity and depth and difficulty yeah when we're talking about progress in deep learning and I think I don't know I have this impression that the view that like machines just want to learn and you know you just have to figure out some way of getting gradients to flow like you know this seems similar to The Bitter lesson essay I to me this perspective is I feel like I see it in computer scientists in deep learning people um yeah so uh but I think that that the confidence derived from having made that work uh seems like it may lead to a kind of underestimation of the difficulty of the alignment problem right if you think about look we we really cracked deep learning as a capabilities problem and like surely alignment is quite similar to that and therefore because we're very clever and have lots of resources and we really nailed this problem therefore we will make a lot of progress on that problem uh that may be true but it doesn't seem like it's an inference that you can make to me so I I guess I do incline towards thinking that alignment is actually a different kind of problem potentially to making the thing work in the first place and this is quite similar to The View that I was attributing to sorz earlier uh and I think I mean there are good reasons you know fundamental Reasons from like a view of Statistics or whatever to think that that might be the case I think it's not just a guess so I I think I I do believe that there are different kinds of problems uh and therefore that yeah I mean that has a bearing on the relative importance of like I do think alignment may be theoretically blocked uh because it is a kind of problem that you may need theoretical progress for now yeah if I mean I suppose if uh what does that mean so if we look at the empirical approaches to alignment that are happening in the big labs and they seem to really be making significant contributions to the core problems of alignment uh and at the same time capabilities sort of seem blocked then I guess that does necessarily mean that I would move against my view on the relative value of theoretical progress because it might not be necessary for alignment but might unblock capabilities progress or something yeah yeah for what it's worth I think at least for many people I get the impression that um kind of optimism about praic alignment thing maybe comes more from this idea that you know somehow the key to alignment is kind of in the data and we've just got to figure out a way to tap into it rather than you know we're all very smart and we can solve hard problems and alignments just as hard as making capabilities work M maybe there so like I guess so so I I this is my interpretation of what I guess people like Norah belose Quinton Pop um Matthew Barnett think um they're you know they're welcome to correct me I might be misrepresenting them I guess there's also a point of view of people like Yan Lon who think that we're not going to have things that are very agentic so we kind of don't need to worry about it um maybe that is kind of a different perspective um sure so I guess changing topics a bit suppose someone has listened to this podcast and they're kind of interested in this research program of you know developing singular learning theory making it useful for um AI alignment things what are what are the open problem s or the open like research directions that are that they could potentially tap into M uh yeah so I'll I'll name a few but there is a list on the uh Dev inter turp web page so if you go to Dev interp deev n tp.com uh there's an open problems page and there's a Discord there where where this question gets asked fairly frequently and and you'll find some replies um yeah I think uh let me see so maybe there are several different categories of things which are more or less suited to people with different kinds of backgrounds um so I think there will be there already are and will be an increasing number of people coming from like pure mathematics or rather theoretical ends of physics who ask this question to them I have different answers to people coming from like ml or computer science um so uh maybe I'll start with the uh more concrete end and then move into the the more abstract end so on the conrete front um so the current tool Central Tool in developmental interpretability is local learning coefficient estimation uh I mentioned that you know this work that Zach and Edmund did gives us some confidence in those estimates for deep linear networks uh but there is a lot of expertise is out there in approximate basian sampling uh from people in probabilistic programming to just basian statistics in general and uh I think a lot more could be done to understand the question of why s sgld is working to the extent it works so I've posed this there was a recent deep learning theory conference in lawn organized by my colleagu Susan and Peter Bartlett at Deep Mind um and I oppose this is an open problem there I think it's a good problem like so the original paper uh that introduced sld has a kind of proof that it should be a good sampler but this proof well I mean I wouldn't say it's actually a proof of what you informally mean when you say sgld works so I would say it's actually a mystery why sgld is accurately sampling the LLC even in deep linear networks understanding that would give us some clue as to how to improve it or understand what it's doing more generally and this kind of scalable approximate beian sampling will be fundamental to many other things we'll do in the future with SLT so if we want to understand like more about the learn structure in neural networks how the local geometry relates to the structure of circuits etc etc all of that will at the bottom rely on better and better understanding of these approximate sampling techniques so I would say there's a large class of important fundamental questions to do with that um okay A second class of questions uh more empirically is studying stage-wise development in more systems um taking the kind of toolkit that we've now developed and applied to deep linear networks to the toy model of superposition and small Transformers um just running that on different systems uh we had some Matt Scholars um uh Cindy wo and and Garett Baker and Shinu chin looking at this recently and there's a lot more in that direction one can do um yeah I think those are sort of the main uh yeah beyond that maybe I'll defer to the list of open problems on the web page and talk about some more intermediate questions um so there's a lot more people at the moment with ML backgrounds interested in developmental interpretability than there are with the kind of mathematical backgrounds um that would be required to do more translation work so at the moment uh you know there are various other things in SLT like the singular fluctuation um which we haven't been using extensively yet uh which we're starting to use and I know there's a PhD student of CH Harry who's investigating it and maybe a few others uh but this is the other principle in iant besides the learning coefficient in SLT which should also tell us something interesting about development and structure uh but which hasn't been extensively used yet so that's another interesting Direction um but uh yeah there's sort of like a of course you can just take quantities and go and and uh empirically use them but then there's questions like using the local learning coefficient is there's some subtleties like the role of the inverse temperature and and and so on and there are theoretical answers to the question like is it okay for me to do X right when you're doing local learning coefficient estimation are you allowed to use a different inverse temperature well it turns out you are but the reason for that has some theoretical basis and there is a lower set of people who can look at the theory and know that it's Justified to do X um so if you have a bit more of a mathematical background uh helping to lay out more Foundation for knowing which things are sensible to do with these quantities is important single fluctuation is one um yeah then ranging through to the more theoretical at the moment it's basically Simon and myself and my PhD student jongen who are have a strong background in geometry and who are working on SLT Simon theolier who I mentioned earlier um So currently a big problem with SLT is that it it makes you use of the resolution of singularities to do a lot of these integrals but that resolution of singularities procedure is uh kind of Hardcore or something it's a little bit hard to extract intuition from um so we do have an alternative perspective on the core geometry going on there based on something called jet schemes which has a much more dynamical flavor and Simon's been working on that and junk in as well me a little bit um so I would say like we maybe a few months away from having a pretty good starting point from anybody who has a geometric background to see ways to contribute to so the jet scheme story should feed into some of this discussion around stability of structures to data distribution shift that I was mentioning earlier so that's one of the there's lots of kind of interesting theoretical open problems there to do with deformation of singularities that should have a bearing on kind of basic questions data distribution change in basan statistics um yeah so that's a sort of sketch of some of the open directions um but yeah there uh relative to the number of things to be done there are very few people working on this so yeah there's if you have dep you should if you want to work on this show up in the Discord or DM me or email me and ask this question and then I will ask what your background is and I will provide a more detailed answer sure okay uh at the risk of getting suck down a rabbit a bit of a rabbit hole so the singular fluctuation like so I noticed that um in this paper quantifying degeneracy it's one of the two things you develop an estimator for yeah and I don't know maybe I should just read that paper more clearly but I don't understand what the what the point of this is yeah yeah so so the local learning coefficient like we're supposed to care about it because it shows up in the free energy expansion and you know that that's all great what like like what is the singular fluctuation why why should I care about it yeah uh okay I'll give two answers the relation between them is sort of in the mathematics and maybe not so clear but um the first answer which is I think the answer what would give or or rather the gray book would give is that um so if you look at the the gap between so we were talking earlier about the kind of theoretical generalization era right the ch from the truth to the predictive distribution which is some theoretical object you'll never know what that is yeah but uh so you're interested then in the gap between that and something you can actually estimate which you can call the training error uh it's what bab calls the training error I think you know one should not conflate that with some other meaning of training error that you might have in mind but anyway it's some form of generalization error which can be estimated from samples okay so if you can understand that Gap then obviously you can understand the theoretical object um and that Gap is described by a theorem in terms of the local learning or the learning coefficient and the singular fluctuation so the singular fluctuation controls the gap between these theoretical and empirical quantities there's one way of thinking about it so that sort of is its theoretical significance it's much less understood so what and our be Flags in a few different places that this is something he would be uh particularly interested in people studying uh so for example we don't know bounds on it in the way that we might know bounds on the local learning coefficient uh you can estimate it from samples in a similar way we don't have any results saying that estimates based on sldd are accurate or something because we don't have I mean those depend on knowing theoretical values which are much less like known uh in general than learning coefficient values um the second answer to what the singular fluctuation is is that it tells you something about uh the correlation between losses for various data samples um so if you you take a fixed parameter and you look at some data set and it's got N Things in it n samples okay then you can look at the loss for each sample right whose average is the empirical loss right okay so for the I sample you can take Li which is the loss of that parameter on that sample but that's if you think about the parameter as being sampled from the be and posterior locally that's a random variable that depends on W the parameter yeah and then you can take the covariance Matrix of those expectations with respect to all the different samples so like ew of l i time L J where the losses depend on the the the parameter which is sampled from the posterior and that covariance Matrix uh is related to the singular fluctuation so you can think of it it's quite closely related to things like influence functions or uh uh like how sensitive the posterior is for including or leaving out certain samples or leverage samples or these kinds of Notions from statistics so it's it's a kind of measure of um how influential well yeah Ian so it's that covariance Matrix we think that this can be a tool for understanding more fine grained structure than the local learning coefficient um or like correlation functions in that direction not only correlation functions of two two values like that but more um so this is going in the direction of extracting more fine grained information from the posterior than you're getting with a local learning coefficient at some conceptual level yeah sure um gotcha so before we basically wrap up is there any question that you wish I'd asked during this interview but that I have not yet asked well how about a question you did ask but I didn't answer so we can Circle back to uh you asked me I think at some point about how to think about the local learning coefficient for neural networks and then I told some story about a simplified setting so maybe I'll just briefly come back to that um so uh if you think about given an architecture and and given data um the loss function represents constraints right it represents a constraint for certain parameters to uh represent certain relationships between inputs and outputs and the more constraints you impose somehow uh the closer you get to some particular kind of underlying constraint so that's what the population loss is telling you but uh if you think about okay so what what are constraints uh so uh constraints are equations and there's like several ways of combining equations right so if I tell you constraint f equals z and constraint g equals z then you can say this constraint or that constraint uh and that is the equation FG equals z right because if FG is zero then either f is zero or G is zero yeah and if you say the constraint f equals z and the constraint g equals z then that's kind of like taking the sum not quite you have to take all linear combinations to encode the the and this is sort of one of the things geometry talks about but um so that that would be taking the ideal generated by f and g but basically like taking two constraints and taking their conjunction means something like taking their Su okay so that gives you a vision of how you might like take a very complex constraint an overall constraint say one that's exhibited by the population loss the constraint that's implicit in which is like all the structure in your data it's a very hard set of constraints to understand um and the the geometry of the level sets of the population loss like is those constraints right that is like the definition of what geometry is it's telling you all the different ways in which you can vary parameters in such a way that you obey the constraints right so like it's in some sense to logical that the geometry of the population loss is the study of those constraints that are implicit in the data and and I've just given you a mechanism for imagining how complex constraints could be expressed in terms of simpler more Atomic constraints right by by expressing that population loss is for example a sum of positive things such that minimizing it means like minimizing all the separate things that would be one decomposition uh that's kind of looks like an and and then if I give you any individual one of those things writing it as a product would like give you a way of decomposing it with ores um and this is what geometers do all day so we take complex constraints and we study how they decompose into more Atomic pieces in such a way that they can be reconstructed to express the overall original geometry constraint sure so this is sort of how uh geometry can be applied to uh I mean first of all why the structure in the data becomes structure of the geometry um and secondly why the local learning coefficient which is a measure of the complexity of that geometry is like conceptually quite natural to think about it as a measure of like the complexity of the representation of the solution that you have in a given neighborhood of parameter space because at that point in parameter space the loss function maybe doesn't quite know about all the constraints because it only manage to represent some part of the structure but to the extent that it's representing the structure in the data it is is making the geometry complex in proportion to how much it has learned and hence why the learning coefficient which measures that geometry is reflecting uh how much has been learned about the data so that's a kind of story for why this connection to Geometry is not maybe as esoteric as it as it seems all right um well uh to close up if people are interested in following your research how should they do that yeah so they can find me on Twitter at Daniel murfett um but I think the main way to get in touch with the research and the community is to go to devent tub.com as I mentioned earlier and um make yourself known on the Discord and uh feel free to ask questions there and we're all on there and we'll answer questions yeah cool um yeah I I guess another thing I want to plug there is there's this YouTube channel I think it's called developmental interpretability that's right um and it has a bunch of uh good Talks by you and other people about um about you know this line of research into singular learning theory as well as the lectures um that I intended yeah great well it's been really nice having you on uh thank you for coming yeah thanks Daniel this episode is edited by Jack Garrett and Amber helped with transcription the opening and closing themes are also by Jack Garrett financial support for this episode was provided by the long-term future fund along with patrons such as Alexi maaf to read a transcript of this episode or to learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about this podcast you can email me at feedback asp.net [Music] [Laughter] [Music] [Laughter] [Music] for [Music]
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs