Library / In focus

Back to Library
AXRPCivilisational risk and strategy

Lee Sharkey on Attribution-based Parameter Decomposition

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through Lee Sharkey on Attribution-based Parameter Decomposition, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 134 full-transcript segments: median 0 · mean -0 · spread -105 (p10–p90 00) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.

Slice bands
134 slices · p10–p90 00

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes alignment
  • - Emphasizes safety
  • - Full transcript scored in 134 sequential slices (median slice 0).

Editor note

A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.

ai-safetyaxrpcore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video ZmJ-ov2TywM · stored Apr 2, 2026 · 3,320 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/lee-sharkey-on-attribution-based-parameter-decomposition.json when you have a listen-based summary.

Show full transcript
[Music] Hello everybody. In this episode, I'll be speaking with Lee Shy. Lee is an interpretability researcher at Goodfire. He co-founded Apollo Research, which he recently left, and he is most wellknown for his early work on sparse autoenccoders. Links to what we're speaking about are available in the description. There's a transcript available at XRP.net. You can tell me what you think about this episode at axrp.fyi. That's axrp.fyi. And you can become a patron at patreon.com/xrpodcast. Well, let's continue to the episode. Well, Lee, welcome to Axer. It's great to be here. Yeah. So, today we're going to talk about uh this paper interpretability in parameter space uh minimizing mechanistic description length with attribution based parameter decomposition. Uh it's authored by Dan Brown, Lucius Bushnock, Stefan Heimerheim. those three being I guess joint first authors, Jake Mendel and yourself. So I guess how would you summarize just what's this paper doing? Yeah, so I would say that this paper was um born out of two lines of thinking. Um one primarily, you know, coming from uh what I was thinking about and one coming from where Lucius was thinking about. And where I was coming from was, you know, we'd been uh working with SAEs, sparse autoenccoders, um for, you know, some time. Uh the community got quite excited about them. Um and we' just been, you know, thinking about them quite a lot and noticing just a bunch of uh conceptual and ultimately practical issues with them. Um and then you know the the line of thinking that Lucius had been uh thinking about um was thinking about just you know a a potential like uh area of um of research that might form you know a foundation for for uh decomposing neural networks. And what these what this paper does is basically bring those lines of thinking together. And the the whole thing that we're uh trying to you know achieve here is just breaking up the parameters of the network instead of its activations. Fair enough. Um so so yeah I I guess the hope is to when when you say break up the parameters of the network um yeah if if I look at the paper you know that there you have this APD method and it seems like there's the core of it is this like objective of like here's how we're going to decompose the network there you know there are these three parts of the objective. Can you walk us through just like what those are? Yeah. So as I mentioned, you know, the whole goal here is to to break up the parameters of a network into um different components. So the and this is you know necessary for understanding what the what the objective um of this algorithm is. So we have you know a uh a neural network and you know as uh many of these networks are they're composed of matrices and these matrices are the the parameters of of the of the network. And even though these are even though these are matrices, you can you know flatten these matrices out and just uh you know concatenate them all together. And uh this is just and you just make one big vector uh that you call um you know your parameter vector. And this you know is a and your your your neural network lives as a a vector in parameter space. Y and what the method does is it basically um supposes that you can break up the neural network into a bunch of uh mechanisms and those mechanisms sum together to the uh the parameters of the original network. And so what we want to do then is we want to start with a you know a uh a set of parameter vectors that all sum to the um well initially this you know they sum to a random uh random vector because they're randomly initialized. But we basically want to optimize these parameter components these you know the components of this sum that and we want to sum to the original network. We optimize them such that uh one um they do actually in fact sum to the parameters of the original network. two that um as few as possible of them are used on any given um forward pass. And three that they are in some sense um simple that individually they don't use very much you know computational machinery because you know one of the ways that you might uh have a have a um parameter a set of parameter vectors that sums to the parameters of the original model is just to have uh you know one parameter vector that is in fact the uh parameters of the original um uh network and they you know the as few as possible of these are used in any given forward pass because that's just um it just uses one of these parameter vectors. Um, but it's not very simple, right? You haven't really done very much work to decompose this into, you know, smaller steps that you might uh more easily be able to understand. And so you want these individual parameter components to be simple as well um as uh faithful to the original network that is sum and also minimal as few as possible of them are are necessary on any given forward pass. Gotcha. So one thing that I guess immediately struck me about this idea is like so it sort of um it sort of presented as like ah sapes have these problems and so we're going to do this thing and it strikes me as almost just like a sparse autoenccoder for the network right like what's a sparse autoenccoder well you want to like you know you have this like uh activation layer and you want to you know have something that recreates the activation layer and you wanted to like sparsely activate you there a bunch of components they should sparsely activate on any given like thing and you know if you train it with like L L1 loss or L2 loss or something like somehow you're uh somehow you're supporting simplicity as well. Um how I'm wondering like how much do you think there is to this analogy? I do think there are many parallels for sure. Um I wouldn't want to overstate them because um you know they I do feel like much more satisfied with uh the APD direction and so but you know as you as you point out there are many similarities you know um you might think of uh SAEs as um in some sense minimizing the description length of a given set of activations. you want to be able to, you know, be able to describe in as um few bits as possible, uh a you know, a given a given data set of activations in in a given layer. Um but the but yeah, the method basically um I don't know it it it focus on focuses on a slightly different object. It focuses on parameter space rather than um activation space, right? And in that in that sense kind of focuses more on um the computations rather than the results of the computations. Um but there is I do you know it's not a coincidence that um you know uh we've been thinking about SAEs for some time and then you know we we come up with this uh direction it there there are some deeper similarities there and but I I think that the core similarity is that um whenever you're describing a neural network you in some sense want to use as few objects as possible because in that way you're going to be able to break it up into you know individually um you know more understandable or simpler chunks and the hope then is that you can uh you know if you understand those chunks you can understand the rest of the the network as well and so both of them rely on that principle but uh you know act on different objects and and a few other differences as well. Sure. So I think just to to get people to understand like uh what ABD is I think I think it's actually helpful to sort of go through the three parts of the objective and talk through them. So I guess I guess the first part is um you have these like this set of vectors in parameter space and they're they have to sum up together to make the whole network. Um yeah, I think the first thing that strikes um that might strike or or at least it struck me as somewhat strange is because you're looking at vectors in parameter space rather than like you know subsets of the parameters of the neural network. You're allowed to say like oh yeah this mechanism is you know three times this parameter minus two times this parameter plus half of this other parameter. um like like at first blush there there seems it seems like there's something kind of strange about that. Um and I'm wondering like is this well firstly are there like other parts of the objective that like you know mitigate against this sort of thing um or if there aren't like is this just the thing that ought to be done? I'm inclined to say that implicitly there are uh you know that I think what we will find um is that uh and we and we do you know to some extent find this in some of our experiments that even though networks try not you know we don't want our um understanding of a neural network to be in some sense um privilege some basis either in activation space or in parameter space you know that the network doesn't necessarily um have a you know a fundamental uh we don't we don't get to presume a fundamental um basis uh we have to go and find that basis either in activation space or parameter space. However, um you know, you might be familiar with the idea of just privileged bases. Um this is the idea that uh because of the um activation function serving as these nonlinearities, certain bases might be preferred. Um the and in particular bases that somewhat align with uh neurons, although you know, not equivalent to, you know, the unit basis in um you know, sorry, the the the neuron basis. Um h so it it does feel likely to be the case that um because neural networks uh have some they seem to have some tendency to um align uh with the neuron bases under you know some uh data distributions and some training objectives. Um I would guess then that uh you know APD if those if if those bases are indeed privileged in the network APD should be able to um you know recover them. Uh and thus you know implicitly kind of has a you know a bias toward if if it has a bias toward finding true things in the network and the network has uh you know some um it privileges some basis then you know it it should ultimately uh find that. But I I'm not sure if it does have a you know a a part of the objective that biases it toward that uh explicitly. Fair enough. So I guess the next thing I want to talk about that's uh I guess some somewhat distinct to the method is the second part of um optimizing for minimality. So what yeah concretely like how does this work? What are you actually like optimizing for here? Yeah. So we came up with a couple of different ways that you might be able to do this and we we we use one in particular that is um we we call it the top k method. So we have these um set of parameter components that we're uh training and we want them to each individually um you know take the form of one individual mechanism um of the of the network. And the way we um we we want these mechanisms to have the properties such that as few as possible of them are uh necessary on a given forward pass. And so the way we optimize for this then is that we have a we have two forward passes and one and two backward passes. So on the first forward pass we use um the uh the summed parameters sorry the sum parameter components which are you know approximately equivalent to the um parameters of the original network. And then on the first backward pass we take the uh gradients of the each of the output dimensions um with respect to the parameters. And the idea here is that we use these um these gradients as attributions of which of these parameter components was most influential over the output. Okay. And so in some sense which of these parameter components is uh most causally responsible for the output. Okay. And then on the um so we take those attributions you know each each parameter component has a you know a number that uh is some approximation of how important it was for the output. And is that just uh is that is that like is that number just the like sum of the absolute values of all the attributions or um it's a slightly more complicated um formula. We basically you take some inner product with the parameter components themselves and the gradients. Um but you can conceptually uh you know approximated uh with that it's it's roughly that idea. Sure. Um and so you basically uh you have this number that you know tells you how rough approximately how important this uh parameter component was for the output and then you say well I'm going to only take the top k most important parameter components and then you do a second forward pass only using the top k most important parameter components and what this should do whenever you train the output of that second forward pass to be the same as the original model is that it should um update these active parameter components such that they become more important on this forward pass for that data point. Um so uh they basically should increase their attributions um on this data point uh compared with um before the before the the gradient update and the gradient update is just the the second um backward pass. So yeah, there's the Yeah, that's basically what the the four steps of the um of that training step do. Gotcha. So, so I guess my main question is it seems like the fundamental thing here is like minimizing the number not the like number of mechanisms total but the number of mechanisms that are like relevant for any one um for any like single uh you know forward pass of the network. Um, I I think when I first came across this idea, it just like wasn't at all intuitive to me why that should be the the minimality that's necessary rather than just like minimizing the total number. Um, yeah. What what's going on there? Yeah. Um, so I'm just trying to understand um what the the confusion is. I I think the the way maybe to think about it is that uh if I wanted to minimize just the number of uh parameter components that I used on any given forward pass, you know what one um one thing I might do is just uh as as we were discussing earlier, you know, we may just use the parameters of the original network. Um, of course, this doesn't really um this isn't satisfactory because uh it doesn't break up the the parameter components into you know something that is simpler than the original network. And so one thing that you might um you know so already we don't get to just um you know minimize uh the number of parameter components that are active on a given forward pass. Um so you might then you know imagine that there is like a predefiner components I've split up the network into um versus how simple they are. Yeah. And there's going to be some you know uh for a given level of simplicity um I'm going to require a you know a certain number of um parameter components on a given forward pass. Um but yeah, you don't really get to um I guess I don't know. Yeah, maybe maybe you can uh spell the the question. B bas basically my question is um so in actual ABD right you're one of the things you're minim optimizing for is that on any given forward pass there should be few components active but like on different forward paths like maybe you have like maybe on on sorry passes maybe on this forward pass you have com you know mechanisms like 1 three and five active on this forward pass you have mechanisms two four and six active and then you're like ah this this is like uh pretty good you know um but you can imagine a world where you say like hey I just want there to be as little few mechanisms as possible just for all the inputs right so so so in in this hypothetical network where like um you know you have like 1 three and five on this input two four and six on this input for ABD you're saying like oh yeah that's like it's only using three mechanisms for any forward pass but you could have a hypothetical method that's saying ah that's like six mechanisms that are being used in total and I want to minimize that number. So why is it like the per forward pass number that we want to minimize? Yeah, I I think it is in fact the other one that you want to to minimize. You you do want to minimize the total number um because we're you know we're ultimately um averaging the gradient steps over uh batches such that um it will on average point toward um a configuration such that the uh that if you get to share a parameter components between these different data points. you know, if if you have a data point that has, you know, 1, three, and five, and another one that has um one, four, and six, um this one should be favored over the uh the you know, the one where you just get to split up, you know, one into two different mechanisms that that um that are that is active on on both of these data points. Um I guess what I'm saying is that uh you arrive at um you arrive at a case sorry you you you basically do want to optimize for uh cases where things are shared and thus kind of um where there is few me as few mechanisms as possible over the the entire data set. Um you just happen to be doing this you know uh batchwise over individual data points. My understanding of the method is so you have you have this like batch of inputs right and what you do is in in a batched way um you like for each input you take the top k you don't really you do batch top k but like that's an implementation detail right y so for each of these inputs you take the top k um mechanisms that are being used and then you upweight and and then you like do a backward pass where you're optimizing for on each input the top k things that were used for that input are um basically optimized to you know better reconstruct you know the the the output of the network on that particular input sure and so I don't see mechanistically like you know it if I have the like 135 246 Mhm. Right. I don't see like how that's going to be optimized for like no actually you should like use the same uh few things because like like you're just taking the top K for both of them right you're not touching you're not like like I don't see where the gradient term would be for things to share mechanisms like in the in the 1352 446 case I think you just like up 135 on the first input and up 246 on the second input right? Yeah, I guess maybe it might be useful to think of a you know a concrete example. So maybe the the the toy model of superposition model might be a useful um example here. So suppose we so you know the toy model of super superp position um is uh a a model developed um at anthropic uh it's a very simple model where there are sparsely activating um input features and these you just a few of these are active on any given um input and the the model um is consists of it's it's also just a very simple model. It's got a a weight matrix and that weight matrix you know has as many um suppose it has as many uh rows as the number of in sparsely active input features and it has as many um it has this basically smaller number of columns. So it's basically like a dime projection matrix from data space into this hidden space and so and then you know this this down projection gets up projected by uh this matrix whenever you um you know like you could use in theory a different matrix but you can just transpose a matrix and uh that um spits out a you know a uh an output um after you uh pass it through a a ReLU um activation and basically want to reconstruct the the input data and there's a bias in there just uh so what um suppose then you have some uh input datom which has you know the first and the uh third and the fifth um features active now these are active you know the these are in fact um you know active inputs uh but in the um in APD you don't necessarily, you know, get to uh you haven't yet learned um whether or not there is a your parameter components haven't yet learned um to correspond to these um input features. M. So you might then um have a bunch of uh basically yeah I guess maybe one thing then to to think about is well suppose you have a bunch of different parameter components. Um they're because they're not aligned with the the features in some sense they uh there's too many of these parameter components active. Um and you want to uh you you know APD is designed to learn as many sorry only as many parameter components as are necessary to explain the um whatever the network is doing uh for you know on on this data distribution and you basically don't want it to learn more parameter components than are necessary. Uh and this is you know what you achieve by both optimizing for um minimality such that as few as possible are necessary and that they are as simple as possible. And so suppose you have two parameter components where um even though the ground truth um feature one is act active you have um two of your parameter components are in fact act active um one is you know maybe slightly more active than the other. Um but you you don't want this because like ultimately you want to uh learn only one of these parameter components per um input feature. And the idea then is that um in some of these cases where this input feature is active um and one of these you know parameter components uh is more active than the other because it would be you know statistically unlikely that they are are equally active. Um there will be cases where because you're thresholding this um the one that does uh get active will get updated such that it um in future cases like this where the feature one is active um it gets more attribution versus uh cases where it where it doesn't. Now, I'm not sure if this fully addresses um your concern, but I think maybe one of the um I guess I'm pointing at if there is a ground truth where um one thing is active and two things in two parameter components are uh then active, then this is something that we uh do in fact get to avoid by um minimizing uh both for for minimality also for simplicity. Right. So I think the the the image I am getting from your answer which is which which might be totally wrong. So tell me if it's wrong. But I I think the thing you're saying is okay probably for every input there are in fact like some parts of the network that are like you know more active on that input and and like there there I think you're almost saying like imagine there is some ground truth decomposition that's like not super big right? Well, if I have like input A and input B, right? And they in fact do use many of the same mechanisms, um, then it's going to be then basically APD is going to be disincentivized from the like 135 246 solution just because like you know it it's going to be, you know, you're picking like few mechanisms active on any given thing, but you're trying to make it like mimic what the network actually does. And so if the thing the network is actually doing is uh you know using some of the same like actual parts of the actual network then you're going to push these these like 2 4 6 you know to be close to the actual mechanisms of the actual network and you're pushing one two and three to be close to the actual mechanisms of the actual network. So like they're they're just going to merge basically. Yeah. Is that roughly right? Yeah I think so. Yeah. Okay. So, so in that case like yeah the so so it seems like basically for this story to work you basically you're basically saying no there is some ground truth decomposition and like because we're you know we're doing this thing that's getting close to the ground truth decomposition that that's like what's powering our thing working as as opposed to some sort of construct constructivist thing of like ah here's just the nicest way we can find of like decomposing things. Yeah, this is a this is a question I haven't quite made up my mind on yet. Um I I think you know in toy models um it is it can be the case that you have a ground truth decomposition because you made it that way. Um and you know the way that you um might have uh designed this is that you know I if someone came to you and um told you well you know I've got a equivalent way to describe this network that you designed yourself um and their description you know uses uh more you know it uses either more components than is necessary or it uses more complex components than is necessary then you might say Well, sure, kind of. But like I think this other explanation, the one I used in my head to design this network is is better. Um, and in some sense then, you know, it is more toward this like constructivist um way of thinking. Maybe then, you know, there is actually no such thing as a ground truth explanation for the network even though you designed it and even though you said this is the ground truth explanation. Um if there are other equivalent things where um more objects more complexity was necessary um then sure there's still explanations but they're not as good. And in the case of more natural networks maybe it is also the case that even though um we can debate whether or not there is some ground truth to the thing that the network is doing. the style of explanation that we you know most prefer is something that is um you know the shortest simplest uh explanation for for what the the network is is doing. So um so I think but before we we go further into the philosophy of APD I think I want to just like get through the the parts so that people fully understand. Um so the third kind of component of this objective function is simplicity. you're optimizing like each component to be simple. Um can you tell us like what's simple? Yeah. So the way we defined simple, you know, simple is supposed to capture this uh this intuitive notion that um it uses as little computational machinery as possible. And what you know what does it mean for a uh a set of matrices to to use as little computational um machinery as possible. Um the definition that we settled on was um that if the uh network consists of you know one matrix that matrix is as low rank as possible. um you can't get much simpler than a uh you know a a rank one matrix and a rank two matrix is is is less simple and it does more things to you know a a given um input vector. Um and if your you know network consists of more um more matrices um than just one uh you basically um are get penalized for you know uh ranks in in in those matrices as well. So it's basically the the thing that we want to minimize is the sum of the ranks over all of the um of the matrices um in a network. Now the I don't know we are kind of we're not fully um happy with this but we do think that this is like a fairly reasonable uh reasonable notion of you know what it means to to use as as little computational um machinery as possible. Yeah. So in some way so if I think about what that's saying right like saying that the so so so there is something intuitive there right like for instance if you use fewer matrices that should be count as more simple like lower rank is basically saying like your matrix is secretly over like smaller dimensional input/simler dimensional output space. Yeah. Um I think in some ways what it's saying is that like the it it's in some ways being basis independent in this kind of interesting sense right like um you're sort of saying that like you know the identity function versus like a weird like rotation and scaling like as long as you're doing it on the same number of dimensions like it's uh those counts as the same wi-i- which I think is actually plausible given that like you know different layers activations functions are like in some sense uh you know maybe they just should be incomparable in that way maybe you don't want to equate like these neurons with these neurons I think like maybe the other thing that seems slightly strange about that is like by being basis independent like by saying that you know the complexity of this this weight matrix is just like the the rank of it suppose you have like you know two two components in one layer right um by saying complexity of both of them is the rank. Somehow you're saying that like the the basis you're thinking about for the computation of like thing A and the basis you're thinking about for the computation for thing B are just like not related at all. Um and maybe there's something there that's worth I I don't exactly know what the objection there would be, but it it seems like there's possibly something there that's worth worth getting into. Yeah, I mean I I think this is um I think that's just something that we're willing to to accept. We do want to basically um in some sense what the the exercise we're uh trying to do here is basically discretise the network into you know discrete objects. Um and ideally we want to discretize it into you know objects that have as little to do with each other as possible. Um and you know if it is the case then that uh you know we can in fact um you know just distinguish between one kind of operation and another um you know sometimes that operation is used and on other data points it is not um then I think we're kind of we're okay with that. Um but you know one of the the reasons that APD was developed was the case of you know multi-dimensional features. Um and the idea of a you know multi-dimensional feature is that well maybe you don't get to just break things up into rank one components. Maybe you actually do in fact you know need um need more than one. So the the classic example here is um the days of the week features where uh the days of the week lie on you know a a points and a circle. Um and and and crucially they're in the right order. Right. There's Monday then Tuesday then Wednesday. Yeah. Exactly. Um and you know in order to describe these features um sure you can describe them as seven different um directions in activation space but you can more um more succinctly describe them as uh you know two basically two dimensional um objects basically uh and you know if you want to I don't know understand the Yeah, I guess I don't know understand the operations that are done on those. Yeah, it might just be useful to think of them um as as two dimensions rather than um you know seven dimensed one-dimensional objects. Um like the idea is that we want ABD to be able to uh decompose networks into chunks that if they do have these um computational units that are best thought of as two dimensional rather than one-dimensional that it can indeed uh find those um and isn't just you know uh decomposing things into into many objects. Fair enough. So I guess I next want to just talk about like so so I want to like test my intuitions for like okay it basically do these do these objective functions make sense when I compare against you know certain examples. So the first example I want to ask about is suppose I have a de composition of the network that's just like each component is one layer of the network like component one is the first layer component two is the second layer component three is the third layer um is that going to I feel like that might score well on APD as long as like you're allowed like that many components right and so my reason for thinking this is basically you have the cost that each matrix is full rank but unless there are like I guess it's possible that there are like unused um unused dimensions in the network that you could prune off right like if there are some like sometimes rel uh in re networks some neurons will die so so yeah suppose you're like uh taking the weight matrices but you're like pruning off the dead relies right it seems like it seems like that might actually be optimal as long as you're allowed that many components just because like it's a perfect recreation of the network and No other way of like mixing around the you know the the mat the things is going to save on like the total rank because you just need that much rank total. Is is that right? Um it depends on the data distribution. Um so I there is a case where where it is right but it's a fairly strange case. Um okay suppose you have a data distribution where for every input um all of your rel layers are always active. Um, so fine, you've printed off the dead ones. Um, those, you know, those never activate, but on the other ones that do activate, they're always active. Um, so everything's always above threshold. And so what you've really got here is just like three linear transformations. Um and in that case uh you yeah you don't really get to um summarize that any more than just you know describing the the three linear transformations because on every given input in our data distribution um there's always like some nonzero amount um that each rank is is used. Fine. there's going to be some, you know, infatimally small number of cases where it's perfectly orthogonal to some of the um, you know, some of the some of the smallest uh, singular dimensions of some of these matrices. Um, where and and in that very small number of cases uh, you know that a parameter component that aligns with um, sorry, an activation uh that aligns with parameter components that sorry align with that dimension they you know the attribution will be zero. But um in almost every case all of the ranks will be used. Now you can imagine for certain um other data distributions um well I guess maybe one way to think about it is that that wouldn't be a very useful neural network because it's just doing the same transformation to every input um and uh you know that you might as well just use one um linear transformation. The interesting thing about um neural networks is that they can uh they can do different transformations to different inputs. Um and in that case then um you know in some inputs um you may use transformations that go one way and on other inputs you may use transformations that go another way. Um and in that's the kind of thing that you want to be able to to break up uh using APD. Right. So, so, so, sorry, if I think through this example, it seems like, so suppose you have these alternate me set of mechanisms, right? This alternate decomposition where like on on this input, you know, we're only using like, you know, this half of the neurons and on this input, we're only using like this half of the neurons. Yeah. Um, at first, sorry, maybe I'm telling about this wrong. It seems like this is actually a case where minimality per input is actually buying you something because I in my imagination you're still using like the same amount of rank and like maybe you still have the same total number of things but the thing you're saving on is like in the per layer thing like like every layer is active on every input, right? But if you can break it up with like oh you know this is only using a subset of neurons so I only need this subset of the mechanisms like is it it seems like maybe the thing I'm saving on there is like uh you know maybe it's rank and maybe it's number of neurons but like sorry number of components but like on a per input basis rather than like over all the inputs. Yeah I think that's right. So you know you might suppose you have um two components in each of these layers um and you got three layers and so you've got you know six components um overall well if your if your parameter components um suppose your your uh your data distribution is you know uh split up such that you can in fact um you know throw away half the network uh that is involved in you know one half of the data distribution and you can for the other half of the data distribution, throw away the the other half of the network. Um so you can basically just treat these as two separate networks that happen to be you know um mushed into one and um if you know so we've got these we got these six um parameter components and if they're you know lined up such that um you know one of these three of these parameter components are they correspond to one of these data distribution one of these you know data distributions and the other three corresponds to the uh the other data distribution then yes that you know on some inputs you'll will be able to use only three and on others um well yeah in all cases you'll be able to um you know use uh just three but in you know if your parameter components don't line up perfectly with these distributions you'll have to use six every time um which is yeah just not something that you want to do if you want to you know um break it up sorry decompose it into you know a smaller number of active objects at any given point. Okay so so I I think I feel satisfied with that case. I next want to just talk about um so this is this is like a little bit out there but to help me understand um I think it would be helpful for me to talk about doing APD to a car right um so so basically because like a car is an instance where I feel like I understand what them well okay I'm I'm not actually that good with cars but I but I have a vague sense of like what they're like and I and I think I have a vague sense of what the mechanisms in cars are right so if I imagine like taking a car and doing APD to it, right? I want some decomposition of all the like stuff in the car that firstly uh all the stuff in all the decompositions just like reconstitutes the whole car. I'm not like leaving out a bit of the car. That makes sense to me. Yeah. Secondly, um I want there to be as few parts to my decomposition that are like relevant on any given like car situation. So, so like you know there's some situation like maybe maybe suppose we discretize time, right? And there's like some input to me driving and then I do a thing and then the car, you know, uh maybe it has to be a self-driving car for this to fully make sense. Sure. Um and then the third thing is that each component uh the the components have to be as simple as possible. Right. One concern I have is I think when people are driving a car usually like there are a bunch of components that are active at the same time. Yeah. That are basically always active at the same time even though I think of them as different components. So one example I think is if you're like the the steering wheel, right? There's always like an angle at which the steering wheel is going and whenever the car is on like that that angle matters to how the car is going. Sure. There's also a speedometer which like tells you how fast you're going, right? And that speedometer is always active whenever the thing is on. Yeah. Now, if I imagine like would APD tell me that the steering wheel and the speedometer are the same or part of the same component? I worry that it would because there's no I I I think there's no like complexity hit from describing like like if I describe them separately that I have the complexity of the speedome speedometer plus the complexity of the steering wheel. Yeah. Um you know these two things and if I describe them jointly as a speedome speedometer and the steering wheel then I've got to describe the speedometer and I've got to describe the steering wheel like same amount of comp complexity. Yeah. Um, but in the case where I merge them, I have like one component instead of two. And like there's never like some there's never like some cases where the steering wheel is active but the speedometer is not active or vice versa. Uh, if I understand cars correctly, maybe maybe people have a counter example. Um, so in this case like yeah would APD tell me that the spinometer and the steering wheel are the same and are part of the same mechanism? And if so, is this a problem? Um I think it there's a kind of like I don't know functionalist stance that we're taking here that like you know we want to understand um a particular function of the car and we're you know I think it might help to to specify what that um what that function is. So um suppose uh suppose that function is just like can I get um me a human from A to B um and so suppose you know I live in a country that doesn't require um spinometers um and I you know don't really care what uh what my speed is um and it really just doesn't affect uh you know my behavior and therefore it doesn't affect you know the the um behavior of the you know the car um in this case uh you know the we we can basically ablate the speedometer and the car would go from A to B with very little changed. Um now in a different context uh you know we whether or not this a speedometer might um affect uh you know the the the decomposition that we think is like the most succinct description of you know the the the thing that is doing the driving from A to B. Um another analogy here um might be uh a more general um case might be well we have the engine and we have the brakes. Now whenever I'm moving um the the brakes are not always on and so whenever I don't need the brakes whenever I'm not braking um I can basically ablate the brakes and my behavior of the the car the behavior of you know the the the Lee and car system is basically going to be unchanged. Now of course if I you know if I ablate the brakes and then do want them um there is a you know a difference between those two worlds where I do have the brakes and I and I don't um there's like uh some sense in which um breaking it up in into thing that makes the car go forward and thing that makes the car go you know stop um is actually a useful decomposition. Um so bringing it back to your example I I do think that um it it matters like the the kind of function that we are specifying here. Um, and in the case that you mentioned, um, it might not matter whether or not you, uh, you know, decompose the car into the engine and the speedometer because it's all one part of, uh, in your example, there was, you know, uh, no driver. Um, and it's all part of, you know, one causal process. Uh, the, you know, the speedometer is just basically intrinsically attached to the to the engine. And we can, you know, we therefore don't really get to uh chunk this the system up into um you know, two different objects. But because the the stance what we're like describing as the function here matters um that kind of depend like determines whether or not you can uh you can in you know in one stance ablated and in another stance not um sorry in one sense you know decompose them and in another stance um not right. So I think yeah so so I mean maybe one way of saying this is that like part of the like like how do you tell that the speedometer and the steering wheel are different? Well, one way you can do it is you can have like sort of test cases, right? Where like uh you know uh you have this guy who like doesn't really care about how fast he's going, which which is still a little bit weird, right? Because like at least back when I was driving, like that was relevant to like how you know can you turn. Um sure, but I maybe you can just figure that out by looking at the road and you know being smart, right? Um, but like at the very least you can go to a mechanic and you can like get your car in in some sort of like test situation where like you know you're you're just like checking if the speedometer is accurate by like hooking it up to some like car treadmill thing and like sure you know the the steering wheel doesn't matter there maybe um or or vice versa. Um, so, so one way I could think about this is like this shows kind of the importance of like a diversity of like inputs for ABD that you you've really got to look at the whole the whole sort of relevant input space and if you don't look at the whole relevant input space you might inappropriately merge some mechanisms that like you could have distinguished does yeah that is that maybe a takeway um yeah that that feels right um it does feel that it feels right that you know in order understand all the functions we do need to um you know look at all the kind of cases sorry in order to decompose networks into um all the distinct mechanisms we do need to look at all the cases where those mechanisms um may be distinguishable um yeah that that feels that feels um like a reasonable takeaway yeah sure I guess the next thing actually the other thing about the car that I thought about when you were talking about it is seems relevant for just like identifying which mechanisms are active. So in the in the paper the test for whether a mechanism is active is this like gradient based attribution right which is basically like if you changed this bit of the network would that result in a different output right now suppose I'm driving and I'm not using the brakes if you change the brakes such that they were like always on then that would change my driving behavior right or or like like even in an incremental way right like if if changed the brake pedal such that it was always like a little bit pressed, that would be slowing me down. Yeah. Um, so am I right to think that uh if you use and maybe we're just like straining the limits of analogy or whatever, but am I right to think that if we used the equivalent of like gradient based attribution to decomposing a car, you would be thinking that the brakes were like always an active uh mechanism. I think it it may be the like running up I against the limits of the analogy maybe. Like one of the things that the gradient um based attribution is supposed to approximate is um if you were to um you know what gradients are actually measuring is like if you you know twiddle the the um activations or the parameters in one direction um will it affect uh you know the the thing with which you're taking the gradient um uh off. So um and the you know the I don't know this is supposed to approximate basically like how ablatable is this direction you're basically saying if I moved um if I didn't just do a small twiddle but like did a very large twiddle to you know from where I currently am to to zero um then uh should it affect the um you know the the thing that I'm taking the gradient of uh you're basically taking a first order approximation of um the effect of ablating. Yeah, that's just like you know what you're what you're uh trying to do whenever you're taking the gradient tier. Y um and yeah, so if you so maybe ablatability is like a you know a a way to to you know port this into the the analogy. Um hence like if you can ablate the brakes uh you can um you know uh and nothing changes in that situation then uh you know the the brakes are in some sense um for this you know for this moment the brakes are degenerate. the brakes just are not needed. Uh for for this particular um you know data point where a data point where uh you know I did not need to break but on data points where I was breaking um I do not get to ablate the brakes and have that uh you know the the state does change quite a lot um whether I uh uplate the brakes or not in cases where I am in fact requiring the brakes. Fair enough. So I guess the last question that I want to ask just to help me understand like ABD is if if I recall correctly in the either the abstract or the introduction of the paper there's this disclaimer that like okay you know parts of this are just like you know implementation details and like there there's kind of a core there's a core idea and there's like how you made it work for this paper and like those are not like quite the same thing. Yeah. out of the stuff that we talked about like yeah which parts to you feel like kind of the core important parts of like you know the version of APD that you you're interested in investigating and which parts of it just felt like okay this is like the first way I thought of to operationalize this thing. Um certainly using gradient gradient based attributions is not something that we're wed to at all. um what they're supposed to um do as I mentioned is just figure you know it's it's supposed to get some measure of how causally important these um you know a given parameter component is now it's not the only um potential method that you might um consider using uh you know people have um we you could you should be able to sub in any um you know method of uh of causal attribution um there and and replace that uh this this is something you know that we're uh keen to replace basically because gradient based attributions will have you know all sorts of predictable pathologies such as um well you know I mentioned that it's a first order approximation of um causal ablation but it is really just it's a first order approximation it's not going to be very good um you there will be cases where um you know if you uh if you twiddle the the the parameters in a certain direction um and the output doesn't change very much. Uh but in fact it is if you blade it you know the entire way it does change um it does change a lot. A classic example of this is um attention where if you're you know if you're um really paying a lot of attention to a particular sequence position um you know your attention um softmax is basically uh saturated um in that you know on that uh on that um sequence position and even if you you know change the parameters a fair bit it may you know locally it may not change very much but if you change them um a lot you may go from a regime where you're saturated to non-saturated And uh you know then you realize ah in fact this was uh you know a a causally important um causally important uh you know sequence position. Um and so there's just lots of uh predictable pathologies that will arise out of gradientbased attributions. Um we're also like not totally wedded to uh the definition of simplicity that we have. um we're open to you know other potential definitions that um may be more principled. For instance um you know one of the the main motivations uh in the you know design process of this um method was not to be basis privileged. Um and there are you know there's a bunch of reasons for this but um one of the uh one of the um one of the reasons is that well representations or computations in neural networks seem to be distributed over a very large number of uh different different things. The classic case is that you know you don't get to just look at an individual neuron and understand um uh you know an an individual function within the the network by looking at one neuron. have to, you know, at very least look at multiple um multiple neurons. Things seem to be distributed over multiple neurons. But it gets even, you know, worse than that. It's uh distri this representations may be distributed across uh you know multiple layers. In fact, in especially in in residual uh networks where um you don't really get to just look at one layer to understand something. You have to look at multiple. And the same thing goes for for attention heads. Maybe uh you know maybe in fact um you know a lot of um analysis looks at individual attention heads but this is kind of an assumption. We're kind of assuming that um the network has chunked it up such that one head does one thing and right there's some like intuitive reasons to believe that but there are some intuitive reasons to believe that one neuron does one thing and there's there's no fundamental reason why it can't distribute things across um you know uh attention heads and there's some uh you know toy examples and some um empirical evidence that this may be happening in uh in networks and so this is you know there's a bunch of reasons why you might not want to be um basis privileged and uh the thing that um our simplicity um measure it. It does in fact um you know uh privilege layers um because it's the sum over layers. Uh it doesn't you know privilege particular ranks but it does priv privilege layers and um you know we're we're open to to versions of this um metric that uh don't um don't privilege privilege layers. Um, aside from that, um, yeah, I think I think what we're, um, the the fundamental thing about this this whole method is that, uh, we get to, um, decompose parameters into directions in parameter space. And we're um, you know, open to different ways to to doing this. Um it's it's more we we hope this is just like a a first pass of a general class of methods um that do parameter decomposition and the kind that we're you know introducing um to some extent here is like linear parameter decomposition. We're decomposing it into something that sums to the the the parameters of a of the original network. And we think that's likely to be you know a a somewhat um powerful way to decompose networks. not necessarily the only one. Um but we yeah we hope this points toward you know a a broader class of of networks that of of which APD is just one. Sure. Um okay it turns out I lied. I have another question just about how you actually how the method actually works which is so I don't know I guess obviously there are a few hyperparameters in APD training but one that feels very salient to me is like how many like how many uh components actually get to be active on any given thing. Um yeah like how so first of all how how in fact do you pick that? It is one of the things that we want to uh move away from um in future uh future versions of the model. Um I mentioned that we were using you know an implementation that is like a top k implementation um where you're just choosing you know a certain value of k and you're saying this is the um you know the the number that is active on each data point. In fact, we used batch topk where you're saying um you you get a little bit more flexibility per data point, but you still have to say over you know a batch of um you know a given size uh we still want on average there to be only k active per data point. Um and you know that's that's a hyperparameter that um is you know like one of the main issues with the whole method is that it's it's currently like still pretty hyperparameter sensitive. This is just one of the hyperparameters that um you know if you manage to uh get rid of then um you may you know arrive at a more robust method. Um the the way that we you know choose it is basically because we've got toy models we have ground truth and we can you know know whether or not the method is doing the right thing and we can basically search for the right number of um values of K such that you know it it yields the the the ground truth mechanisms. But um yeah, we want something that's more robust such that if you didn't know the ground truth mechanisms, you could just choose a you know an okay value for the hyperparameter and you would be you know uh you could rest assured that you you know should end up with something approximately right. Right? So one thing that occurs to me is um so in the in the title um it says like uh minimizing uh mechanistic minimizing mechanism description length with attribution based parameter decomposition and um you present it as part of this like you know minimizing minimal description length like part part of this family of things where you know you're trying to like you know run some algorithm to describe stuff and you know you you want to minimize all these ideas of Solman offuction and stuff. And I thought that one of the points of minimal description length type things was that it offered you this ability to like have this principal choice of how to choose hyperparameters, right? like like or at least these sorts of hyperparameters like I think of MDL as saying like oh you know you could you know you can when when you're doing regression right you know you can model it as like a you know degree 1 polinomial or you can model it as a degree 2 or degree 3 and you you know you have this like trade-off between fit and something else and like MDL is supposed to tell you like how many degrees of your polinomial you're supposed to have right in in a similar way I would I would imagine that it should be able to tell you like okay what's how How many components are you supposed to divide into? Um is that like I guess you must have thought of this like does does that actually work? The story is a little bit um more nuanced in that like we you know minimum description length um whenever you're dealing with say some continuous variables um you may have to you know fix one of your continuous variables and say for a given value of this continuous variable how few can I get in these other variables right um and you know suppose you know in the case of an SAE you might say for a given um you know uh mean squed error or you know um like how how low can I get my um uh my L0 sorry how low can I get basically the the description of the set of activations um where you know that depends on like how many things are active for a given data point and um you know how many uh features I've used in my in my sparse autoenccoder dictionary the same thing kind of applies um in APD you need to you know fix some of your um variables So um the mean squed error is one of them. Um you know for a if you really want your uh your mean squed error to be you know um very very low you might need you might get to ablate fewer um parameter components um because you know you'll just predictably increase the loss um if you uplate things uh even if your parameter components are are perfect. Um but there are also um some you know other uh other continuous variables here. um the even though we're using the we're trying to minimize the rank we we don't you know rank is a a non-ifferiable quantity what we are in fact getting to to minimize is um the uh it's basically the the sum of the singular values of a of a of the matrix. This is what uh you know we called in the paper shat norm. Um yeah that's just the the name of the quantity. Um and so we are this is like a continuous approximation of the of the rank. Um if you minimize this uh to yeah basically if you minimize this you minimize minimize the rank. Um but it's not a perfect quantity. So you but this is our you know measure of simplicity and we kind of have to say you know for a given level of simplicity how few um active components do we do we get to have? So there's like a lot of degrees of freedom um that we uh you know have to to hold constant such that um we can hold them constant and say you know how well can I do uh in terms of minimum description length but um yeah we basically want to you know get toward a method such that we um you know we hold these things constant at a sufficiently low level uh that uh we don't have to um you know really worry that um we're we're uh you know introducing arbitrary choices, right? I mean I I think so in in terms of like okay you've got to like balance against the loss like I I had this impression that like you know for a lot of these like mean squareed error losses you could actually like think of it as the likelihood of something and end up measuring it in bits. Um I I don't know as much about so it makes sense that you would have to think about singular values rather than literal rank because like the presence of any noise like every matrix is full rank right um yeah I I think my I wonder if it's a thing I wonder if it's going to come down to so so you are dealing like like one thing going on with description length type things is that like uh you know description length is inherently a discrete concept right? Like how many bits are using to describe a thing and if the thing is continuous it's like well at what resolution do you want to describe it and I think like this ends up being a hyperparameter like a hyperparameter of MDL that like um does it seems like it's relevant in this case it seems like you know how many bits do you need to describe the stuff like if it's parameters then like you know you can control that by saying if I quantize my network with How however many bits how bad is that? Like I don't know. Maybe I I should maybe this is one of these things where if I like sat down and tried to do it, I'd realize the issue, but it seems doable to me. It see it seems like there there's possibly something here. Yeah, I do I do agree that um it feels like we should be able to uh at least find you know a satisfactory um you know predier um for uh minimum description length. Uh I'm not sure we'll be able to get away from um you know the from that you know requiring that um it it just be a pre-front. I'm not sure there will be some sort of you know uh single optimal um version of it but um at very least I do think we can do better than the current algorithm. Fair enough. So I think the thing I next want to talk about is um basically the uh experiments you run in your paper. So in my recollection um in the main paper there are like basically two experiments or you know con conceptually there are two types of experiments. So there's firstly this like toy models of superp position and secondly there's this uh uh compress I forgot what the short name of it sorry compressed computation compressed computation yeah um so maybe I mean you you spoke about it a little bit earlier but um first can you like recap uh yeah what what's how the toy model of superp position experiments are working. Yeah. So some uh some of the folks who are reading our paper and uh many listeners will be you know familiar with the model but um again it's just this uh matrix that projects sparsely activating um data uh down into some bottleneck space and uh in that bottleneck space features have to be represented in superp position such that there are more features than uh dimensions in this in this bottleneck space and then the the matrix has to up project um um back to the uh original a space of the original size of the um the number of data features and then you know basically so it's it's it's like an autoenccoder setup um and because it uh gets to I don't know because it compresses um these data set features um down it it's kind of in some sense unintuitive that it it can actually do this um because it has fewer uh dimensions than than um than features. Um but it can because and because it has these you know like fewer feature sorry fewer dimensions there will be some you know interference between um features that are not orthogonal to each other in this bottleneck space. Um, but the way it gets over this is that uh because it has um ReLU activations following the up projection um it can kind of filter out some of these uh this interference noise um and you know do a reasonably good job at uh at reconstructing the the input data features. Now one of the ways you might think about this network is that um we have this we have this matrix and if one of the input data features is active well only say one um one row of the matrix is actually necessary. we can basically throw away um the other rows we can set them to zero. um in cases where only this um you know one let's call it you know input data feature one um is active and the in particular the row that we get to that we have to keep is you know the one is the first row um yep in so we can basically set the uh the other rows to zero and so in there's some sense in which the rows of our uh toy model are like the the ground truth mechanisms And because why are they why are they the ground truth mechanisms? Well, they they satisfy the um the properties that we wanted to uh that we're aiming to recover um here. So we have you know they all sum to the uh the original network that is all the rows whenever you set to zero the other rows um set you know that basically sums to the uh to the original network. um on a because the the uh then looking at minimality because the data set features are sparsely activating um there is uh yeah like if you only activate the um mechanism that corresponds to that data set feature and you don't activate um other ones well this is going to be the shortest you know the smallest number of mechanisms that you have to activate on this data point so it's minimal and they're um simple in some sense in that uh single rows of this matrix when you zero out all the other um rows are rank one. Um yep they just correspond to the outer product of a you know a an indicator vector and um and the row itself. So um they satisfy the uh what we want to um call a you know a ground truth mechanism. Um and this is the the the things that we're basically optimizing um are randomly initialized parameter components to to try to approximate. Um and so what we've then find whenever we um do this is that at least for a given set of you know hyperparameters um we are able to recover you know this this set of ground truth um features using um yeah a a um yeah and and how like so so in the paper so one thing you mentioned is that uh so the original toy models was superp position you know it like has a bunch of geometry and it draws some pictures and that's partly like relying on the fact that it's you know there are like five inputs and two hidden units and that's a setting where like it's just very small and so things depend on a lot on hyperparameters. So you you also look at a somewhat higher dimensional case where there's like what 50 inputs and 10 hidden units or something. Is that right? It's 40 and 10. Yeah. 40 and 10. Yeah. So so my understanding is that you're pretty hyperparameter sensitive in this like really small setting. in the like 40 and 10 setting, how how hard is it to get the hyperparameters, right? It's easier, but I still think it's pretty hard. Um, it Yeah, the the five and two case is particularly challenging because, you know, optimizing a in a two-dimensional space is is just y uh it's something that gradient descent is is not especially good at. I mean, it can do it. It's just that like moving things moving vectors around each other can be kind of more difficult in two dimensional space versus in you know n dimensional space where you know they basically just get to move in any direction and not interfere with each other right in two dimensional space there's just much greater chance for interference. Okay. And so yeah, I I guess I'm just especially drawn to this hyperparameter of um how many components you have. Um just I don't know is it for some reason to me it feels like the most juicy hyperparameter even though you know obviously I guess like relative weighting of these objective terms and you know all sorts of things are also important. Um if you get the the number of um well like in this case you have a ground truth number of components. Yeah, if you get the number of components slightly wrong, like what what happens? How bad does it go? Um I don't I can't recall like an exact story for what happens, but you know it for some cases it will kind of learn you know a a bunch of reasonable features but then in some cases it will you know in some features will just not be learned. Um in other cases uh it will be just much more noisy and it will it'll fail to learn altogether. Um I won't I I can't give a good sense of like how sensitive it is to this um hyperparameter. Uh yeah um my colleague Dan will have a much you know more informed sense of like the you know how how sensitive it is to to twiddling this. Um it it's also hard to tell, you know, if is it this hyperparameter that, you know, is the most sensitive thing versus others, right? There's just because there's basically a bunch of different um hyperparameters to to get right here. Uh it's it's hard to get really intuitive around any given one of them. Um yeah. So, okay. Um I eventually want to get to a question about like all of the about these experiments in general. And so in order to get me there, can you tell me about the compressed computation setup and like what's going on there? Yeah. Yeah. So compressed computation is the name for a phenomenon that we um uh that we you know observed in our um experiments. We were initially trying to um model uh two different things. one is a a theoretically uh well-grounded um phenomenon that is that my colleagues well uh colleagues Lucius and and Jake uh had um talked about in uh a previous post of theirs. So computation and superposition where a network is basically um learning to compute more functions than it has neurons. Um and there's also a related um phenomenon that's more empirical which is from the um original toyal superposition paper that they also called computation in superposition which and then then there's this also this third uh phenomenon that we've called compressed computation. Now it may be the case that all of these are the same thing. Um but it we are not yet confident enough to say that they are all exactly the same phenomenon. The reason is that um we are not super super confident at least we're not at the time we became a little bit more confident and have slightly uh updated against the update uh that the compressed computation is similar to um these other uh forms of you know the these other phenomena uh computation superposition which one um I would not be able to answer. Okay. the Yeah, but like the overall um but it is nevertheless the case that all of these can be described as learning um learning more learning to compute more functions than you have neurons. Um it's just that there's a fair bit of wiggle room in uh the words when you put those words into maths. Sure. and and the the basic so with toy models of superp position the basic intuition for why it was possible to you know reconstruct more stuff than you had um you know hidden uh hidden activation space dimensions was that like the stuff you had to reconstruct was like sparse and so you know you didn't have to do that many things at a time is that the same thing sorry this this is almost definitely the paper in compressed computation like is the trick just like it doesn't have to compute all the things at the same time or somehow it like really does compute all the things at the same time with like less space than you thought you would need. This is the this is the point in which we're uncertain. Basically, we are basically um we are not super confident about how much this phenomenon depends on sparity. Now, we're also just not super confident on how much the uh the anthropic uh toy model of uh computation and superposition depends on sparity. Now, we know in their example it it does, but we don't because we don't have access to um you know the the the experiments. We don't know how much um you know what was going on in the backgrounds of those of those figures. Um we just haven't you know got round to doing extensive experiments to actually figure that out. It wouldn't be too difficult. Um but in our case, we're basically quite uncertain as to how much our phenomenon depends on sparity. Uh my colleague Stefan has done some experiments in this direction. um the it's it's somewhat inconclusive um for now. I I think he's he's he's got a project ongoing uh on this um but uh hopefully there'll be a write up of soon but um yeah long story short it may or may not depend on uh on sparsity but I think for the purposes of the conversation it may be reasonable to proceed as though it does. Okay. So, so basically um so basically the the thing of compressed computation is like computing more functions than you know than you have uh you know width of internal neurons. And it's sort of surprising that you'd be able to do it but you can. Um and my understanding is that uh the particular functions you're trying to compute are reli functions of the inputs. Yes. Um and you might be like ah re networks shouldn't they be able to do it? And but the trigger is like you know you know the the network narrows significantly. Yeah. And so what's the like what is the hope here like like what should APD be able to do in this setting if it's working? So in this setting um the the grunt truth mechanisms are supposed to be things where um even though the data has say 100 um you know input dimensions and the labels are uh you know 100 um you know res of that input data. um the the models the models have learned basically to compute 100 reus using only 50 uh reus in the model. Yep. And the idea here is that well if they're able to do this they are in some sense um you know using um they're distributing their their their computation over multiple uh rel such that um they can nevertheless do this um without interfering with other uh features whenever they're not active. Um, so you're basically uh computing more functions than you have neurons because you're not always having to compute them all at the same time, right? And so this is just because like uh like if if you have a negative input then you just all you have to know is that it's you know it's really zero and you don't have to do that much computation to like make sure you have the identity function. Yes. Yep. And also but in in other cases where suppose you know you have two input um uh features that are positive and so you need to compute you know two um two relu. Well, if a you know if you have basically projected um one of your input features to a one set of uh you know hidden um hidden neurons such that you can uh do uh you can spread your um absolute sorry your relu function over these multiple rel um and if they are like a a different set of uh hidden relu neurons than the other feature then you should be able to get a you know make a good approximation of um the uh the re of the input data. Um because you know the magnitude matters here. Um you don't really want um if if you use the same um suppose there was some overlap in in one of their neurons between these two these two input features well they would uh double up and that would you know contribute to they would basically overshoot um in the in the output. And so if you spread things out a little bit um such that they don't you know overlap very much uh you should be able to compute um things with some interference but um you know ultimately comput a bit more functions um right then you have neurons but yeah the cost is interference. Gotcha. And so yeah, just as long as you're like distributing over the set of neurons is sorry a thing that I just realized the fact that you're going from 100 inputs to like 50 wide which is half of 100 is that just the same thing is is that just for this for the same reason as like numbers have a 50% chance of being positive and negative and so on average you only need to represent half of them. Um I don't think the number 50 was especially important. Um, okay. I think we could have easily chosen something something else. Uh, I think it was just a Yeah. Yeah, I think it was somewhat arbitrary. Okay, fair enough. Um, all right. So, uh, so so I was asking like what the what APD is like meant to get and what what was what was the answer to that again? Yeah. So um yeah, thanks for reminding me. So the what so I was trying to get a sense of what the ground truth features should be. Um and the ground truth sorry I said ground truth features grunt truth um mechanisms you know mechanisms and these ground truth mechanisms should be things that distribute across multiple hidden neurons. Um and uh so you know the input um you got this down projection matrix and then this up projection matrix. This is a a rather maybe think about it as a um you know an embedding matrix uh then a an MLP matrix and then an ML sorry an embedding matrix an MLP in matrix an MLP out matrix and then unmbedding. Um so it's it's a residual uh architecture and so you have this uh embedding matrix and this uh MLP in matrix and whenever you um multiply these uh two matrices together you basically want to uh show that a given input dimension projects onto multiple hidden features y sorry multiple hidden neurons Um and this is what one you know component should do and and those hidden com those hidden um neurons should then project back to uh that input sorry well that output feature that corresponds to the input feature that you care most about. And so you can basically do this for um multiple uh you know multiple input and output features. Um, you want your because your your input and output features are like sparsely activating, you want your parameter components um to mostly only correspond to um one of these uh input and output um right computations. And so you want basically to to have uh parameter components that um line up strongly with these uh input and output um components, right? So, so it seems like the the thing is you you don't like maybe you don't know exactly like what the like what the uh components should like you you don't exactly know which parameters should light up or whatever but you do know like for each like for each component that APD finds it should you know reconstruct the rel of like exactly one input and like none of the rest of them is is that basically right yeah basically yeah um because yeah in in this case we basically get to define what the the minimal set of um components is because we get to you know choose a lot about the data distribution. Okay. So I think the thing that I'm wondering about with both of these tests is so I think of the the kind of idea of APD is like you know previously a bunch of people have been like trying to explain feature you know representation of features you know they've like looked at these neurons they've said you know what do these neurons represent but you want to find the mechanisms right now the thing that strikes me about both of these examples is they feel very representation-y to you know, they're like, "Okay, we, you know, we've got this partlessly activating input and we've got, you know, we've got this like lowdimensional um bottleneck space and we want to say, you know, we want to reconstruct these parameter vectors to tell us like how the bottleneck space, you know, is it is able to reconstruct the inputs, you know, each of component of the input or for right, it's like okay, you know, somehow this thing like um you know, like for each of these inputs, there should be like some something that's like representing the rel of that input and I just want to like divide into things that you know get the relu. Yeah. Um like like it seems to me that like networks could have a bunch of mechanisms that like don't necessarily do representing individual you know features or things right or potentially like representing feature like representing things could involve like a bunch of you know mechanisms like like for any given thing you represent like maybe there are like five mechanisms that are necessary. So yeah, but basically I just had this thought reading this paper of like it feels like like it feels like the experiments are too are too representation-y and not mechanisticy enough. What what do you think of this anxiety that I that I'm feeling? Yeah, I think that's reasonable. I I do definitely I I I share this. Um there are a few toy models that we would be keen to, you know, see people um work on. for instance, um you might oh well I I'll also just before I get into that just say yeah I I I do think that there's like some sort of there's in some sense it's not a perfect duality between representation and you know mechanisms or or computation the the kind of computationy kind of point of view. Um there's nevertheless a relationship. Um yeah, it is therefore you know more a matter of perspective like which one is most convenient to think about at a given point in time. No, I I do think that when designing these toy models, um we wanted to get, you know, a method that works in a very simple, you know, uh setups where the uh these representations do in fact correspond to the mechanisms, right? And this is just this is a case where it's been easy to design um networks where there's a a ground truth that's easily accessible to us. um we find a little bit harder to you know design um uh well rather train um networks where you could be somewhat sure of what the ground truth was even though there are you know multiple computational steps that may be involved I think it's perfectly possible um we know we we did have some cases where we handcrafted um some uh models there's an example of this in the appendix But that had uh you know some pathologies. The gradients didn't work you know especially well on this because it was handcrafted and so it we we did find it somewhat challenging. Now there are some models that uh you could think of that um may uh capture this notion um a little bit more than the ones in the paper. Uh, one that's very similar to what is in the paper could be consider a toy model of superposition model where instead of just um this down projection and then up projection, you have a down projection and then um for example an identity matrix in the middle and then you know an up projection or you know you can replace this identity uh with you know some other um some other uh you know rotation Okay. Um, now like what would you want APD to find here? Well, you don't really get to think about it in terms of representations anymore. Um, because, you know, fine, you've like broken it up into these mechanisms uh, in the input, you know, in the down projection and in the up projection, but there's this bottleneck where you're doing something in the bottleneck. You know, if it's an identity or a trans or or a rotation. Um, suppose it's a rotation. just it's probably easier to think of that you're you're basically having to use um you know all ranks of that rotation in the middle um at all for every given data point. You don't actually get to to chunk it up. And so what you would want then is uh you'd want APD to find is uh parameter components that correspond to the um the things that we originally found for the for the simpler uh example here where it's just the rows of the um the rows of the uh down projection matrix and the up projection matrix. Um but then also a component that corresponds to this you know rotation in the middle. Um, yeah. Why? Because you're having to use this all ranks of this rotation for every data point. You always have to do it. You don't get to throw it away um and reduce your minimality. You don't get to make it any simpler to and to reduce your simplicity. Um it's it's just always there. And so this this is maybe a case where uh you do get to think about it in terms of computational steps rather than um representations. Yeah. I mean I think that okay actually before I go further just to pick up on a thing I said yeah so in I believe this is uh in appendix B1 you like handd designed these networks to compute these functions or whatever. Yes. Um how how did you hand design the networks? So I believe this was Jake and Lucius and Stefan. Um I may be misattributing uh there. Um, but I've at least included all of them. Uh, one or the other may not have been involved, but um, they I think just thought about it really hard and then came up with it. Oh, I don't have a great uh, you know, they're not they're not super complicated. Um, they're not super complicated uh, networks. They just um, they have, you know, particular steps. They just have a like a little gate. Um, and you you for certain inputs you your gate is uh, your gate is active and on other inputs is not. And this lets you you know do subsequent computations. I could be uh it's been a little while since I've looked at it but um the basic principle is that it's not a complicated uh network. Yeah. Yeah. I guess so my recollection is that um it's like basically sinosoidal functions and yeah I guess if I had to I could write down a network that like if you just divide it up into piece wide line pie wise linears right um for for a wide network you know. Yeah. Yeah. You can figure out how to do it. It's just like kind of tricky. Yeah. Yeah. Um yeah, this network gave us a lot of grief uh because the um you know it's it's it's intuitively quite a simple network. Um but yeah, the because we're using um gradient based attributions. Uh yeah, it just didn't play nice um with with the method. Uh even though um even though to our you know naive uh selves it intuitive intuitively felt like it should. Um But we eventually got it working. Uh but it is you know demoted to the appendix punishment. So so so yeah you're you're talking about um so you mentioned okay in this toy network where you have like um kind of project down do an operation in like down projected space and then like uh unproject back up. Um well like this is ideally what APD should find. Um when you say it like that it sounds like an easy enough experiment to run. Like have I don't know have you tried it? Uh I believe we at various points gave it a go. I think it just wasn't like top priority to before to get the paper out. Um fair enough. I ex I it's very possible that we have got this working already. Um and I'm just forgetting. It's also very possible that uh we had tried it and couldn't get it working or or at least you know didn't want to invest the time um to to get it working such as the the sensitivity of the the hyperparameters but um yeah I would be you know keen to see a uh you know a verification um that it is at least possible for APD to find this. Um intuitively it feels like it it ought to be able to um but you know uh yeah just like see them breaking. Sure. I guess like other things that um strike me as interesting to look into. Um so so there are a few cases in the literature where people do really seem to have identified mechanisms within neural networks. Um I think I think I guess the most famous one is like these induction heads, right? where like in I don't know in induction like uh as some people have pointed out it's like it can be kind of a loose term like people can use it for a few things but in at least some cases people can just point to like look this attention head if you pay attention to like this thing it's doing and this thing it's doing you know or or the I I guess it's two attention heads uh maybe but like you could just tell a very clear story about how it's like looking for an initial thing and then like taking the thing after that and then like this thing copies the thing after that into the output. Um so so that that's one example of a thing that feels very much like a mechanism and does not feel so like representational you know. Um uh another example is um group multiplication. um you know these neural networks that are trained on group multiplication tables and they have to get the other ones like I think um I I guess I was like uh semi involved in a paper um well I you know I chatted with one of the people and like tried to make sure he was on track for things but um there's this Wilson Woo paper that's basically uh together with um Jacob Drawy an author I'm forgetting Jason Gross and basically they um I I I think they end up with like a pretty good story of like how these networks learn group multiplication. Um, and you know they they could basically tell you right they do this thing and then they you know trans you know they do this thing and then they transform it in this way and then they get this out output and like you know it works because of this like weird group theory fact. Um like I think there are a few more of these. Um I guess like for at least those examples like do you think like like can we get APD working on those? How hard would it be to like check if APD actually like works on these? Certain it feels possible certainly in toy models for the induction head. Indeed, it's like, you know, it was one of the motivations for APD that uh I'd been working with various math scholars um Chris Mathwin, uh Keith Monroe um on uh Felix BS as well um on uh attention decomposing you know attention in in uh neural networks and it was it's basically just intuitively quite you know it feels like you should be able to do this uh just you know add in a you know an SAPE here for transcoder there and like you can kind of make some progress in this but it just didn't feel conceptually very like um satisfying um and you know it it basically was one of the motivations for APD that well we really want a method where we don't have to like modify it such that you you know if you have a a slightly different architecture like you know maybe it's a gated linear unit or maybe it's a you know state uh space model um you don't ideally you wouldn't have to like adapt your interpretability methods to to these you should just be able to you know um decompose just apply the the method that you have that works for all neural networks that would be ideal. Um and so this was one of the the motivations like looking at attention and how it may uh actually distribute computations across you know heads or or various other ways to distribute it. Now it feels possible then that you know um we should be able to do this in uh toy models of say induction heads. um it would be you know a somewhat more complicated model than um APD has uh been uh used for used for thus far but it it does feel possible and it's you know one of the one of the things I'm uh excited about people trying um in the cases where um you know say group multiplication or you know uh modular addition it's very possible that if you did apply um APD to to these models um where you don't really get to it feels possible that in these models um all mechanisms are active all the time and therefore you just you know ABD just returns the the original network. Um, and if that's the case, this is a, you know, a a bullet I'm willing to bite on um on you on the method. It sometimes we just don't get to uh decompose, you know, things into um more things than the the original uh original network. These are after all, you know, fairly um special uh special networks trained in, you know, quite different ways from uh you know, the the task that we really care about. um you know such as such as language modeling. Um it it's nevertheless you know something of a uh something to bear in mind when um you know applying uh APD to to models. It's it's not going to you know immediately tell you um in cases where you know it it may be uh a multi-dimensional feature like how to uh understand the the the interactions within this um this multi-dimensional multi-layer component. Um but at the very least what we wanted was a method that would um in cases where you could decompose it where it actually succeeds in doing that right um so sorry I'm the thing you said just uh inspired me to look at this paper so this is uh so the paper is towards a unified and verified understanding of group operation networks um and the author I was forgetting was Lewis Jaburi um so sorry Lewis um I think yeah so there's this question Do for group multiplications do all of the things are all of the things like active at once? I think okay I I think I'm not going to be able to figure it out quickly enough um in time for this. But yeah, basically like I I don't know. It does seem like an interesting question of like can you can you get APD working in a setting where like they're Yeah, I guess it's tricky because like it's a lot easier to have ground truth representations than ground truth um you know than ground truth mechanisms especially if you know like okay I'm I'm an autoenccoder or I'm you know doing like this known function of like each individual input. Um, yeah. And and I guess I guess this sort of just relates to the fact that we understand like like representation is just much easier for us to have good a priority theories of than computation. Um, somewhat unfortunately. I'm curious what you Yeah, I I'm maybe I'm just too APD brain at this point, but I'm curious what Yeah, could you flesh that intuition out a bit more? I feel like um what it means for you know a hidden activation to represent one of the input features in the TMS case doesn't feel intuitively doesn't feel intuitively obvious to me. You know it fine it corresponds to one of the there may be a direction in hidden activation space that corresponds to the um the you one of the one of the input dimensions. um it doesn't feel like more intuitive that point of view than say you know uh this input feature activated that computation. Um I I'm curious. Oh yeah. Yeah. Yeah. I guess like uh all I'm saying is that uh like with toy models of superp position, right? The reason you can very confidently say like okay this like like it feels like the reason you can very confident confidently say like this part is doing this thing is like in some sense you know that like all the all the neural network has to do is like it has to like put a bunch of information into this two-dimensional knapsack and be able to like get it out again right like like that like that's just everything the network's doing. Um and you can kind of say like okay well I understand that you know it should it should be sensitive to all these things and like I I I guess there there are some things you can say about the computation there but like like like for cases like like compressed computation like to multip you can just say like okay look I have these things and I know like I know like this input should correspond to this out output right like that's just like definite ground truth because it's like basically what I'm training on. Yeah. Whereas like it's a lot harder to look at a network and say like well I know it should be doing this computation and I know it should be divided up into this way like that's yeah I think that's fair and therefore like it's easier to like test against like well do I recon do I reconstruct this thing in toy model superposition where I like know what I'm supposed to reconstruct versus like do I reconstruct this you know way of doing things in like this you know where where like I don't you know a priority you don't exactly know like yeah Yeah, I think that's fair. And I think this maybe is, you know, part of the this goes back to a little bit of what we were talking about at the start where um there is this uh even though there may not be there may be like multiple equivalent ways to describe the computations um going on in the network, we really just have to you know be opinionated about like what you know constitutes a good explanation. Um yeah, and you know uh faithfulness of you know faithfulness to the network, minimality and simplicity are just the ones that we we uh think you know are a reasonable um reasonable set of properties for an explanation to have. Fair enough. So okay, I think I'm going to transition into just more asking miscellaneous questions mode less grouped by a theme. Um I think the first thing I want to talk about is that uh at one point in your paper you say that like so why are we doing APD on like these small networks and not on like uh you know uh you know llama you know whatever however many billion models you can get these days. Uh and the answer is that it's like kind of expensive to run APD. How like concretely how expensive is it to actually run? Um so the yeah I like the the version that we've come up with here um is you know we we didn't aim for efficiency. We didn't aim for you know some of the obvious things that you might try to um get a a method that works more efficiently than than ours. The reason being that we wanted something where uh on theoretical grounds we could be you know somewhat uh satisfied with it and satisfied with it working and then after that we can you know move to things that are are more efficient. But so for the current method um what we have here is you know for a start we've got like um let's call it you know the letter C C components um that's you know how many we've got C components um Y and each of these you know require as much memory as the original model right yeah um now that's you know already uh you know a multiple um of the uh expensiveness of the original model just through one forward pass. Uh we also have you know the um first forward pass uh the first backward pass the first sorry the second forward pass and the second backward pass. So we're having to um you know during one training update we have um you know these four uh these four steps. Um yeah, so you're already like it's already a multiple of just you know a given uh forward pass and backward pass that might be required to train a you know an original model but uh yeah I guess yeah different different uh goals with which with each of these steps um yeah and and how like so I don't know may maybe the answer to this is just it's another hyperparameter and you've got to fiddle with it but like um so there's there's a number of hyperparameters that sorry there's a number of components that you want to end up being active, right? Like this this K that you talk about and then there's just like a total number of components that you have to have in order to like run the method at all. Like is there something to say about like oh turns out you need like five times as many total components as you want components active in any single run or is it just like kind of a mess? Well, you know, some people will be familiar with uh trading sparse autoenccoders and the in in some cases you you start with more than you expect to you start with more features than you expect to need. Uh the reason being that um during training some might die and there's various you know tricks that people have invented to to stop them dying, reinitialization and so on. The same is true in APD. uh some of these um parameter components will in some sense die. Um and depending on the on the type of model um and you will you know want therefore to uh you know in general you'll want to train with a little bit more um than the total number of ground truth mechanisms um just so that uh you if in the in the off chance that some do die you still nevertheless have enough um to to to learn all of the all of the ground truth mechanisms. Okay. But but it sounds like you're you're thinking that it has to be like, you know, more like a factor of two than a factor of like a 100 or something. Uh that would be my expectation. I don't think there's going to be, you know, a a a ground truth answer to that, but yeah, I think, you know, it wouldn't be I I don't see any reason why it would need to be, you know, many multiples higher. Okay. So, so yeah, I guess if you're thinking about the expense of this, it seems like okay, you have this like constant blow up of like, you know, you've got to do like two forward passes and two backward passes on each like gradient step and also you've got to keep on, you know, having like, you know, these C copies of the network, you know, around at all times. Um, and then there's this question of like how many steps it takes to actually train ABD, which presumably is just an empirical thing that like, you know, is not super well understood. I think like I I guess one question I have is uh if I remember correctly, there's some part of your paper where you mentioned that like naively APD might take order of n squ time to run. Do I remember that correctly? Yeah, I I think there's this is like a uh a pessimistic upper bound on like the the expensiveness, but I think there's plenty of reasons to to expect it to be lower than this, but um I would need to revisit the the the sentence um to be 100% sure what what we're actually talking about. Yeah, fair fair enough. There is a there is a sentence that does have you know uh talks about the scaling and and mentions and squared. Yeah. Okay. Um so okay the next thing that uh just came across my mind that I wanted to ask is when you're training for minimality right uh so on each input you know you like you know you run it forward you get like the you do attribution to get the k most active um components um and then you like drop all the other components and then like have some training step to make sure the make the k most active component like you know reconstruct the behavior better on that. I'm kind of surprised. It seems like um one thing you could imagine doing is also um having the uh having it be like like training the ones that you dropped to be less relevant on that input than they actually are. Um, I'm wondering if like you tried this and it didn't work or if this is just like less of an obvious idea than it feels like to me. Yeah, I guess so. Like concretely what you might consider doing in that case would be um you know you run it uh you run you might have you know a a a third forward pass where you only run it with h I guess I don't know it may be it may be hard I hadn't thought about this enough but it may be hard to distinguish um between the influences of like uh I don't Oh, I don't know. It feels It feels on the face of it feels like something that uh could be useful to to um to implement. Um if it's possible to implement [Music] um it uh it does feel possible for sure. Um I I don't recall us trying it though. Um yeah, the things that we did try were the top K and then we also tried um like an LP sparity version where um you penalize you know everything for being attributed. You you penalize everything for having some causal influence over the output but you but you penalize um the things that were most causally attributed less proportionately less than um than the the things that were you know had some small influence. Um, and this this is kind of doing that where you're like, "Yeah." Um, but it's not equivalent, but um, yeah, there's it feels possible to to to do that. Um, I'd be I'd be curious if if it could be done. Gotcha. So, I think at this point I'm interested in um what follow-up work you think is important to do on APD. Uh, either they, you know, you think is important to do or, you know, if enterprising listeners like maybe want to pick up some of the slack. Yeah. Um so I mentioned a few of the things um that I'd be keen to see already. Um so you know uh non-exhaustively attention I'd be curious to see you know if if it can decompose attention in a sensible way. Um there's various other things. However the main thing right now is um figuring out whether or not we can make it less sensitive to hyperparameters and more scalable. Basically these are the two um you know robustness and scalability are are the main things that we're uh keen to solve just because it will open up um whenever we do you know investigate these other things like attention uh you know uh distri distributed representations cross attention um that will be less painful to do um and also you can do this in larger more interesting models um so the main thing uh is is is scalability and and uh hyper parameter sensitivity or robustness. So um it's it's yeah so that if those being the two main things suppose we solve those um I would be keen to uh see attention keen to see other types of architecture decomposed here. Um there's also a few um other phenomena that you might be curious to apply APD to. For instance um the the phenomena of uh you know memorization right? Um there's this you might imagine that um memorization whenever you know APD uh successfully decomposes the network into um you know memorized data points versus generalizing data points um memorized there may be you know one parameter component that corresponds to one memorized um data point and uh you know one uh parameter component that corresponds to a general you know a generalizing um generalizing computation within the network. uh it may therefore you know be a nice place to um distinguish between um these two uh kind of computational regimes um of memorization and and generalization. So I I'd be keen to to see APD applied to that. I mentioned some of the um you know more theoretical uh things that you might want to look into such as priv privileging layers um or uh more implementationally figuring out whether or not we can um get rid of uh having to choose having to do a top k setup where you have to choose you know k um then yeah I there's basically yeah there's there's a bunch bunch of um you know fairly uh disperate uh directions all of which I you know I'm super keen to to see done. I I think like um yeah our our main priorities now are just creating a method that that makes those other things a bit easier. Um yeah. Yeah, those are non-exhaustively though. There's there's a more um there's a yeah more exhaustive um list I think in in the in the paper. Makes sense. I I guess like so a couple things that seemed interesting to me that I'm curious if you have comments on. Um so one thing that I I guess somewhat inspired by our discussion um about doing APD to a car, right? Um it seems like like APD is a method that sort of is sensitive to the um to the input distribution that you do training to. And I think like you know there's this uh interpretability illusions paper that says you know some like sometimes like you might think that you have a rich enough input distribution but you don't actually. I think like yeah just like how sensitive you are to this input distribution like how how right you have to get it. I think like I know that that's something that like I don't know if you've like had much preliminary exploration into but like it does seem like pretty relevant. It seems relevant. I think um I think this is in some senses unavoidable. Uh just because you know I want to decompose neural networks. Um what does that mean? Well, it means to decompose you know what these networks are doing. Um what they're doing depends on the the the input distribution. Um and just simply with a with a different distribution um natural or unnatural it just will kind of lead to to different things. Um yeah I do think that um when we get to you know more scalable versions of this method um this will become even more important. uh you ideally want to you know um have a method where suppose you present um suppose you're you know decomposing you know lamma 3 or whatever uh if you've got a scalable um sca scalable method and you present it with one um you present you train it using the the training distribution of llama um but then you also train it with the training distribution of llama permuted you ideally want to you know um end up with the same thing similarly for you know a large enough subset um and then you know more adversarial subsets um it it will be the case that you know for a sufficient level of um adversity it will it will break uh that I I think this maybe you know emphasizes like the importance of just doing interpretability on the on as large a distribution as you possibly can and which stands in contrast from um you know uh some of the interpretability um that's you know happened in the test. I like to call this, you know, just using the the like big data um approach where you're basically just doing you're you're you're finding structure first and then you know asking asking questions later. Um it just feels uh you know it's it's kind of borrowing from um you know regions of science or areas of science where um you there's just like a lot going on and you kind of want to uh you kind of want to you know leverage um computation first to to actually you know narrow down what what questions you really ought to be asking. Um in this you know the application here in interpretability would be you know you want to uh you know let the let the computationally informed let let computational methods do the work first and then you you figure out what does this component mean rather than like you know presupposing some kind of uh your own ideas of of what um the the components ought to be uh and then you know um studying studying those in more detail. uh this is the kind of approach that I think um ABD uh intends to to leverage this you know big big data uh approach um and I think that's somewhat unavoidable in in uh interpretability that I don't know uh can tell you things that you weren't looking for um in the first place. Fair fair enough. Um I guess like so another thing that struck my eye in the paper is so there there's a section that is basically like I I think of this section of the paper as basically saying why essays are bad um and rubbish and one thing that is mentioned is like ah you know there's this like feature geometry in essays right sort of like the day of the week thing where they're they're you know they're in a circle Monday Tuesday Wednesday um and I think there's some line that says you know that like the fact that there is this geometry is not you know like as maybe Jake Mandel has written about this but like this is not you know um purely explained by this like linear representation hypothesis you know we need like to understand mechanisms to get us there how soon until APD tells us uh what's going on with SA feature geometry or feature geometry in general yeah so so jost was um if I'm recalling the title correctly feature geometry is outside the superp position hypothes is right. Um and you know feature geometry is this idea where um you know it's it's a reasonably you know it's it's older than the mechanistic interpretability community. This idea was present in um you know uh neuroscientific literature um a bit before but the the idea here is that um suppose you've got a neural network and you train you know an SAPE on uh the activations um and you you know look at the uh the features that you end up with um you know these features tend to you know correspond to certain things this was the the the whole point of um training sesses uh to identify interpretable um individual components. But whenever you start you know comparing um these the directions of these features relative to each other you'll notice that well um if I look at the um I don't know the Einstein uh feature well it happens to share um you know some it kind of has some non-zero uh rather if I look at you know the Einstein direction the Munich direction the um you know the the later I don't know later hosen direction or and so on. Um you'll find that all these kind of point in somewhat similar directions. Um there's a kind of latent um latent semanticity to them. There there's something underneath these features. These features were supposed to you know correspond to the computational units of neural networks. Um and now what this like feature geometry is indicating is that there's like an underlying computational structure. um that uh that organizes these features relative to each other which is like in my opinion something of a you know uh this doesn't bode well if you considered SAP features to be fundamental units of computation because um you know you shouldn't be able to uh identify these like latent um these latent variables that are shared across uh you know multiple features um and the hope you know like what what's giving the structure what's giving the structure the relative you know the the what what is giving the the geometry to these um these features? Well, like the hypothesis here is that uh is that suppose you've got you know these features pointing um you have the Einstein feature and um you've also got I don't know uh these other you know later hosen um this later hosen feature and so on. Well, these all get the German computation done to them. um these they're all pointing in this direction because like somewhere down uh somewhere down the line in the network um the network needs to do the German computation to them um and just apply some specific transformation or some set of transformations that all you know correspond to uh to to the Germanness of of a thing. Um and you know you can imagine uh other cases um you know for for animal features why do all the animals like point in similar directions? you know, it needs to the network needs to do, you know, animal related computations to them. Um, now you could go further. Why do all the furry animals like point in similar directions? Well, because there's needs to be furry computations done to them. Um the the the hope here is that if you instead of you know studying the features and you know trying to um use that as a lens to understand the the network study the computations and that will like inform why the geometry is the way it is because you get to look at the um the the computations that get done to it which is presumably why the network is structuring them in this way. Um now you don't ne you know it's very possible um that you just kick the can down the road there. You you you may find that if you decompose your you know computations into um you know uh very simple computational units. you might find that you know there's some relationship between your your computational units in terms of geometry but um it nevertheless feels like a you've done better than um the the yeah like the SAP case basically right I'm not saying it's you know it's not obviously totally solved the problem yeah so so how how long until uh APD explains all this um well we first need to um get a uh well either you would need a toy model of feature geometry um such that you you know it's a small enough model that you can apply ABD to it um and you know that toy model would need to be uh convincing um such that it you know people can uh say that it probably applies to to larger models um but absent um a convincing toy model you know you would need to uh be able to scale this such that you can apply it to to larger models I can't say for certain um you know when we'll have a scalable method. It's something we're currently working on. We're very keen for other folks to work on as well. Um you know I I would be speculating uh irresponsibly to to say you know when we'll have a when we'll have a working method for that but you know I would hope that like anywhere between like 3 months and three years you know uh that's the kind of um uncertainty. Yeah. But but I guess it illustrates the importance of you know just robustifying this thing to you know make it easier to run on big bigger instances. Yeah. So I guess the last question that I want to ask is what um like what's the endgame of APD like like is the hope that I you know I run it on you know the underlying model from you know of 03 or whatever and like then I just understand the like all the things it's thinking about at any given point like like how yeah how how should I think about like where where is this going what's it what's it actually going to get me in the like case of you know these big chain of thought you know funky networks yeah it's an important question to ask I I think the way I see this kind of work and the way I see you know the the similar work that came before such as you know sapees or um transcoders or things like this the point is to break up networks into as um simple components as you can such that whenever you try to you know understand larger facts about the network you've got some solid ground to stand on um you can say well you know I I got this set of components um if I understand you know if I were really were invested I I could in theory just understand everything um with this very large number of components now do I really think that um Mechanurb is going to let us understand everything well probably not as humans Um but I I do think that um it will give us solid ground to stand on whenever we want to understand um particular phenomena. Um now if I want to um understand say the deception you know mechanisms within a within EG 03 or any other um model. Well uh where do I go looking for them? Well you know currently we you know look at behaviors. Um, one thing that you might uh, you know, be able to do is look at, you know, transcoder kind of, um, approaches. But because transcoders and, uh, other activation based methods are primarily looking at, um, activations, they're not necessarily giving you the like things that are doing the generalization such that you are, I don't know, I think you can be less confident that um, you're you're understanding, you know, how the network behave would behave in a more general sense. Um, and by looking at like the objects that are doing the generalization, by looking at, you know, the parts of the parameters that are actually doing the thing, um, you might be able to make more, um, more robust claims. So, I think that's yeah, I think it's fair enough to say like, yeah, look at like very specific things. I guess there's also some world in which like once you're able to have these good building blocks, you can do like automated interpretability of everything. like if if you need to for sure. Yeah. I mean I guess I'm leaving that implicit. Yeah. Like the ultimate aim would be that uh you know you can automate the process by which you would understand um parts of the network such that you can understand broader sways of it. Uh and yeah ideally you know you have given yourself a solid enough you know ground to to stand on that you know um whenever you do this things won't fewer things will slip through the cracks. Sure. I guess like one one thing that um strikes me as interesting about these reasoning models in particular. I think that and sorry this might be a bit this this might be kind of farfield but I think a lot of interpretability work has been focused on like understanding single forward passes um like you know what's the like like in vision especially for like classification models right there stuff was done on like vision classification where you know of course you know you just want to find the curve detectors or whatever. Yeah. Um and for essays, you know, you're like, uh which which things are being represented? Like one thing that I think like reasoning models bring to light is in some sense the relevant mechanisms should be thought of as like or it seems like it seems to me like in some sense the relevant mechanisms should be thought of as like distributed across forward passes, right? Like you know you like do a forward pass, you write a thing in your chain of thought, then you like do another forward pass, you write another thing in your chain of thought. And in some sense like the real mechanism is like you know a bunch of these end to- end copies of this network. Um and I wonder like yeah this might be too speculative to ask about but like where where do we go in that setting like like like do you think it still makes sense to like focus so much on like understanding these individual forward passes versus like the whole web of computation? Um I think it probably does. Um the reason being you know what what methods might might we um what alternatives might we uh aim for if we wanted to you know um instead you know just ensure the um the ensure that in these kind of more distributed uh settings where uh computations are spread across you know a whole chain of thought. Well what might we do in that case? we really care about you know the uh we care about the faithfulness um of of the chain of thought. So in the case where we care about the the faithfulness, we want some way to measure like how faithful the the the chain of of thought actually is being and Megan does give you some you know measure of if you can understand you know a a given forward pass and you know maybe even a small chain. Um it should give you like firmer ground to stand on whenever you make claims about yeah this this this method that I developed that improves the the faithfulness of the chain of thought. um like I don't know how you can make such statements without actually having you know some way to measure the the faithfulness of the chain of thought and that's maybe you know one way that that uh meant may be able to to help in that regime. Um it yeah that that's just kind of I don't know the one thing that comes to mind. So wrapping up uh I want to check is there anything that uh I haven't yet asked you that you think I should have? I think one of the things that I find I don't know most satisfying maybe about about thinking about interpretability in parameter space is that many of the notions that um we had going into uh interpretability become a little less confusing. So the one of the the main examples that I have in mind here is just this idea of a feature. Um people you know have used this notion of a feature um in a very intuitive sense and and struggled for a like long time uh you know to actually nail it down. What what is a feature? What are we really talking about here? Um it kind of you know uh evaded um formalism in some sense. Um and I think one of the things that I uh find most satisfying then about interpretability in primary space is that it like gives you some foundation on which to um you know base this uh base this notion. Um in particular uh the thing that we might you know call a feature of a network is a like a a something that uses one parameter component. Um you know for instance uh what does it mean to say that a an image sorry a model has a a feature of a cat inside it. Um well you can you know perhaps equivalently say that um this model does cat computations to you know it's got a cat classifier computation um or it's got you know a a like a a cat uh a cat recognition computation um there's a this is what I mean there's a kind of duality between um it's not an exact duality by any means um but it helps provide a you know a sense in which um features mean something specific in particular means like there's a uh a whenever you think break sorry whenever you break up a network into you know faithful minimal and simple components um these components these mechanisms are uh what you might reasonably call um you know a uh in in some cases you couldn't call them a feature in other cases it's more natural to think about them in terms of uh you know this is a step in the algorithm. This is a computation that the network is is doing. Um and I think in that sense it's a bit more general than uh than thinking about things in terms of features. Fair enough. Um well I guess to to finally wrap up um if uh if people listen to this and they're interested in following your research, how should they do that? Yeah, I post most of my things. Um, I post them on Twitter and I also post on the alignment forum as well. Um, you can uh yeah, just follow me on Twitter and and check out uh yeah, me on the alignment forum and uh so links links to those will be in the uh description, but for those who don't want to open the description, uh are you just Lee Sharky on Twitter in the alignment forum? I think I am Lee Dsharky on Twitter. Uh at least in my Twitter handle, but I should just be Lee Sharky in Foundable by that. Um and yeah, Leaki on. All right. Um well, uh thanks very much for coming here. Uh we've we've been recording for a while and you've been quite generous with your time, so thank you very much. No, thank you, Daniel. It's been great. I've had an awesome time. Cheers. This episode is edited by Cape Brunots and Amber helped with transcription. The opening and closing themes are by Jack Garrett. The episode was recorded at Far Labs. Financial support for the episode was provided by the Long-Term Future Fund along with patrons such as Alexi Maf. To read a transcript, you can visit axrp.net. You can also become a patron at patreon.com/exrd or give a one-off donation at kofi.com/exr. That's kohfi.com/exr. Finally, you can leave your thoughts on this episode at axp.fyi. FYI. [Music] [Music] [Music]

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs