Library / In focus

Back to Library
AXRPCivilisational risk and strategy

Erik Jenner on Learned Look-Ahead

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through Erik Jenner on Learned Look-Ahead, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 24 full-transcript segments: median 0 · mean -1 · spread -100 (p10–p90 00) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.

Slice bands
24 slices · p10–p90 00

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes alignment
  • - Emphasizes safety
  • - Full transcript scored in 24 sequential slices (median slice 0).

Editor note

A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.

ai-safetyaxrpcore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video XGVacNeCT48 · stored Apr 2, 2026 · 700 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/erik-jenner-on-learned-look-ahead.json when you have a listen-based summary.

Show full transcript
hello everyone this is one of a series of short interviews that I've been conducting at the Bay Area alignment Workshop which is run by far AI uh links to what we're discussing as usual are in the description um a transcript is as usual available at axr p.net and as usual if you want to support the podcast you can do so at patreon.com axr podcast well let's continue to the interview all right uh well today I'm speaking with Eric Jenner um Eric can you say a few words about yourself uh yeah I'm currently a thirdy year PhD student at UC Berkeley um at the center for human compatible Ai and yeah working on various things around model internals there cool and right now we're at this uh alignment workshop being run by fari how are you finding it uh it's been fun so far I mean we've only had a few talks but I thought all of them were interesting cool cool well um yeah so speaking of work that you've done um I I guess uh we're going to talk about this chess paper that you worked on so so so that listeners can look it up what's the name of this paper um it's called evidence of learned look ahead in a chest playing NE Network okay and yeah what's the yeah so that sort of tells you what it is but can you elaborate a little bit Yeah so so I guess the question we're asking ourselves is um new networks are pretty good at playing chess now and playing chess in the sense not just of like having monteal tree search with a big explicit search tree but also playing chess if you only give them one forward pass to make every single move okay and so the question is how are they so good at playing chess and in particular any humans or like manual programs we write that are similarly good at chess they like internally have to do a lot of search and think about future moves um rather than just relying on intuitions or heuristics and so the question is are new networks just like really good at heuristically deciding what move to make or are they doing something kind of similar where they're looking ahead in some way when deciding what move to uh make next sure and uh yeah so when you looking ahead in some way I think like the kind of like we have this vague notion of like planning ahead or search um but like you know kind of the devil's in the details of like how you actually operationalize it so how how do you operationalize it in the paper yeah so I think what we like you know ideally we would have wanted to find some like clear search tree in there and stuff like that but realistically we had to sort of settle for something much broader which is just um the model is representing which moves it's likely going to make uh a little bit into the future and uses that to decide which move to make right now when it's representing which moves it is likely to make in the future do you know if that's like so one version of that is oh I guess the sort of thing I'm likely to do in the future is this therefore the current what what would be a reasonable thing to do right now to prepare for this future in which I'm going to do this random thing um versus like thinking carefully about okay well in the future it would be good to do this therefore now it would be good to do this yeah so I mean what we what we look at is like specific move that the moves that the model is going to make in the future so it's not just like some generic type of thing that it might do it's like specific moves what we don't know is sort of exactly what the like algorithm here is like for example you could imagine that the model is like it would be nice if I could play this Checkmate move in the future so now I have to do this to prepare for that or it could be that the model is considering different moves it could make right now and then for each one it's thinking about what would be good follow-ups and use that to evaluate them and we aren't really distinguishing between these two different options we're just saying there's some sense of thinking about future moves and that's important okay so so it's representing future moves and and you also said that there's some way of like taking information about future moves into the present was that right yeah yeah so specifically some of the experiments we do are just probing where we just provide correlational evidence that there is some representation that we can use to extract these future moves but then we also have some experiments where we do certain interventions that we argue or think correspond to intervening on moves and show that we can for example destroy model performance in very specific ways so are the interventions something like if you make the model think that it won't be able to do this Checkmate move later then it like doesn't prepare for it right now yeah basically we sort of we are blocking in some of the experiments for example we're blocking information flow in very specific ways that we think corresponds to like the ability to think about this future techmate and then we show that this is like a much bigger effect on the model's performance than if we do like random other appliations um other parts of the model I wonder so so my understanding is that your data set is a particular subset of Chess puzzles right yeah so we we start with this public data set of uh puzzles that are assigned to be solved by human players and then we do a lot of different filtering to sort of make it suitable for our purposes so for example um we we want puzzles where the response by the opponent is sort of pretty obvious or there's one clear response such that then there's also only one uh follow-up move that we have to look for in the model so if if you could imagine cases where you play a move the opponent has two different good responses and then how you respond to that depends on which one they picked and that would make all of our experiments much harder or more annoying because we don't know which move to look for anymore whereas in our case we sort of we filter for cases where there's actually one ground truth um like Future Move That We can look for yeah so maybe you're filtering this out but it it seems like one interesting thing to look for would be over where there are sort of like like imagine some chess puzzle where there are two ways that I can get checkmate right I can achieve Checkmate right I can like move my knight then something happens then I check pain with my knight or I move my Rook something happens then I check with my Rook right and like one comp one thing that would be kind of interesting is if if a model like chose the night path but if you intervened on it and and if you said yeah in the future you're not going to move your knight to get checkmate if it then like went The Rook route of the Checkmate that that would be kind of interesting like it it would show some sort of like forward planning that's not that that's like responsive to predictions of the future move in like an interesting sort of structured way yeah um yeah do you have any evidence like that or so so yeah there's not much on this in the paper but we we did try a few things kind of like this so for example one one thing we tested is um it's yeah basically we have uh two possible moves like you say and then I think it's a little bit simpler and that in the end sort of everything ends up being very similar but you have two initial moves sort of kick off this Checkmate sequence and then what we tested if is looking at the evaluation of the network so if we intervene on one of those moves um then the evaluation stays high like the network still thinks it's winning um but if we intervene on both then we sort of get the super additive effect where now it realizes or like it it thinks that it's no longer winning there's no Checkmate anymore um so it seems like there's some kind of logical or or like maximum structure in that evaluation where it realizes that either one of those would be sufficient to win okay and and is that like in the appendices of the paper no that's like not even in the appendices that sort of like a thing we tried once and then okay the main problem with this is um we we weren't entirely sure how to sort of rigorously test this across many distribu many puzzles like this was in a few puzzles that we set up but it's it's probably possible but it's non trivial to of automatically find a lot of puzzles with this property right so we just didn't get around to actually turning that into like a real experiment and it's more like a thing we tried wonder I wonder if there's some tag so so you get puzzles from this like online set of Chess puzzles um I wonder if they have some sort of tag for like two-way pods or yeah maybe they definitely have lots of tags but I think it's sort of more for more for geometric motifs like I guess it's sort of for motifs the way humans think about them and maybe there happens to be one like that but I yeah don't think so fair enough so I I guess so I'm interested a little bit in the detail so so so in order to see that the neural net has this representation of the Future Move um you train this probe um like I don't know a generic worry about training probes is it might just be correlational um or it might even just be that like you know you're training a probe on very high dimensional data Maybe probe for anything like how how solid do you think this probing result is yeah I think if we only had the probing result I would say it's very suggestive that there's something interesting but it's very unclear how to interpret it um yeah I mean the way we try to to get around this is by also like we have a very simple Baseline that's just training the same Probe on a randomly initialized model so at least we're making sure that like not the probe isn't doing all the work and we can also see that probing accuracy goes up through the layers of the network so in later layers it's sort of better pring future moves which is kind of encouraging but but yeah I think if we only had the probe um I would be pretty worried about any like actual mechanistic claims about what's going on in the model um and then yeah so that we also have S of these other experiments that I think are that their weakness is that they're less obviously about look ahead like the probe is very obviously just probing for this future moves the other experiments require some interpretation for claiming their about look ahead but they're Interventional and so in that sense more robust and and the Interventional uh experiments are those like like am I right that you're intervening on features that you found using the probe uh no no they're basically uh separate so um for those Interventional experiments so okay I maybe I should say a little bit about the architecture of this network so it's it's a Transformer where every like what would be a token position a language model is of a square of the chess board in this Transformer instead so it gets like a chest boort position as an input and then it does a forward pass and every Square corresponds like a slice of the activations okay and so we can talk about things like the representations that are stored on some specific square and one thing we found pretty consistently is that the network does seem to store information about moves for example on squares that are involved in that move or like kind of in places where you'd expect okay and and so a lot of the other Interventional experiments are more about um looking at the information stored on moves on squares that are involved in future moves or the information flow from those squares to other squares and vice versa and things like that right so I think a lot of the I think the structure for allot of those arguments is basically saying we see these really strong effects where some types of operations are have a much bigger effect than other types of operations and really the only explanation we can come up with is that there's some kind of looking at future moves going on because otherwise these effects just seem pretty inexplicable to us right but it's a little bit trickier to like be sure that there's not some alternative explanation for those results I mean would would it work to so so so the version of this experiment that was in my head was like you know you find a probe you you find some like activations in the network that represent feature moves and then you sort of like ablate the thing where it realizes that it's going to make this feature move as found by this probe yeah I'm wondering like like did that experiment work functionally yeah so you can you can um sort of just subtract the prop Direction and it often messes up performance but you can also like do random other stuff to the activations and it has similarly big effects on performance so yeah it would be nice if there was some experiment you could do where you're sort of not just making the model ignore its best move but like take some other move by adding in certain vectors that you got from your probe um we yeah we don't have any results like that so so for for the actual oblation thing you did is is it a thing where you're blading the information on that square and like a blading information on a random Square would not be that bad but on the Square corresponds to the Future Move That really yeah so for example we have this activation patching experiment activation patching just meaning we take activations from some corrupted forward pass and Patch them into a clean forward pass and that sort of tells us how important is information stored on various squares for the output and then we see that there are some types of squares where we can predict just from the architecture that they're going to be important but then sort of the the main effect we see where squares are important apart from that is on the target square of this uh future move like the one the model is going to make after the one that's making the current position basically and and so so for example the average effect of patching on that square is bigger than the average effect of patching on the Square that has the maximum effect in any given position so if we sort of in every position we take the max of all the other effects on other squares that's still smaller than patching on this one particular square and and so I think it's pretty unlikely that this it's just because those squares happen to be like tically important um it seems like there Pro it's probably the fact that it is this target square of the Future Move that's makes it important okay so I guess another question I have is if someone's reading this paper like they're going to observe like oh yeah you know messing up uh this Square degrades performance more than messing up this square and there's this question of okay how big of and for a lot of these experiments right how big does the effect need to be for it to for for your explanation to be you know the like suit to to be Vindicated basically yeah um how do you think readers should think about that yeah I yeah I mean I I think that's a good question I think overall I would say our effect sizes tend to be pretty big like in the case where we get effects at all but not very consistently so it's it's something like um even after the filtering we do to our puzzle data set there are still some puzzles where we just don't see any effect at all basically and my guess like like we manually looked at some of those and in many cases I think it's kind of reasonable that we don't get an effect or like for each of those we can kind of explain it a way but you know that's uh maybe that shouldn't convince weers um so I think that's one of the big reasons to be skeptical of them but then I think sort of the like the average effect sizes or the effect sizes in the positions where there is anything at all um I think they're often pretty big like for example we typically we can for example if the probability that the model assigned to the top move um without any intervention was like 50% or 60 % um it often goes down to something like 10% or sometimes even much less than that from so very small interventions where if you do uh an equivalently big intervention but in some random place in the model you just see no effect at all basically right yeah I guess it's interesting like um so one like one metric you could be tracking is just like accuracy loss but I guess there's this other metric you could track which is like accuracy loss per amount of activation youed like oh yeah these are the really efficient activ so personally like one of the interventions we have which is we this one attention head in the model that we think one of the main things it's doing or the main thing is sort of moving information from this target square of a future move to the Target square of the immediate move the model is going to make next so that seems very suggestive of some look ahead stuff going on and and one of the ways we we validate that this is like the main thing the had is doing is um by sort of ablating only this information FL between those two squares which correspond sort of zeroing out a single number in this um attention map of of that head and that has a very big effect whereas if we zero out all the other numbers in that attention head or if we zero out like some random other attention head completely it has a very small effect so there sort of this one one activation like one floating Point number we can zero out which has much bigger effects than most other act interventions we can do gotcha so um so this is the AI exis research podcast I'm wondering like do you see this work like do you see it just as an academic curiosity or do you think that it's relevant to understanding X risk from AI yeah Al it was it was definitely partially motivated by um X risk initially I think like in hindsight um the impact on like reducing X risk is probably not that amazing um but but some of the connections that I was sort of interested in initially are I guess the main thing is you know people in the XK space have been talking a lot about models being scary if they can do internal uh search or planning and things like that and so for example it might be nice if we had General methods that could tell us whether a certain model was doing internal planning um it would also be nice if we could then localize that to some extent and maybe that's an especially important Target for inability or for interventions on what the model does um so so I think there sort of both a perspective of understanding whether the models are capable of this and then also like maybe this is an important interoperability Target yeah um I guess one thing that occurs to me um that's ni about your work is just like specifying like what do we even mean by search you know what do we mean by like internal look ahead like um I don't know there are sort of different algorithms that networks could be doing and you could imagine like different things you might call search or whatever having different um safety concerns and just like mapping out the space of like quasi you know quasi reasoning like things that models could have going on it seems like it could potentially be important yeah I I think I agree with that and I yeah I I don't think we do a lot of that in this work you know like we're basically saying okay we we take this one particular yeah rough definition of what we mean by look ahead which is mainly motivated by well that's what we contr show the model is actually doing um rather than by carefully thinking through the threat models here um yeah then I I think one of the reasons why maybe I'm not sure how important all of this like this project is for X risk is that yeah the specific kinds of things we find um I think Al of unusually clean because we're looking at this uh chess playing model where um the domain is very simple and we can find relatively explicit signs of look Ahad algorithms at least um and then I kind of suspect it if you want to think about like oh is my language model internally planning this just looks pretty different potentially but I mean yeah this could this could be a first step and I think if if someone could you know do similar things but for language models somehow that seems much harder but also probably closer to what we care about yeah so do do you think um our I don't know uh I guess uh you worked on this paper and you have a bunch of co-authors I'm wondering like uh how much follow-up work should I expect to see in the next year or two um yeah so none of my co-authors or like all I don't have any concrete plans for followup work but I'm I've talked to several people who like there there's been a little bit of follow-up work already by some independent people people who just happen to see the paper and did some small experiments on that same model and I know of people who are interested in like doing this in more interesting settings so so yeah I would guess we'll see at least some sort of smaller projects on chess models and I would be excited to also see something on language models but I'm much less sure what that should look like sure so yeah so if listeners are interested in like maybe doing some follow-up work it sounds like um you think trying to find something similar in language models and seeing if the mechanism is similar or different It's s like that was one thing you mentioned are there other types of followup that you would be excited to see um yeah so I guess the other direction for followup would mainly be just trying to understand better what happens in uh Leela or in some other similar model like I would say our understanding is still very like we we have a lot of evidence I think that there's something interesting going on but we don't really understand the algorithms well um and I think yeah I think it would be an interesting Target for just doing typical mechanistic inability where you try to figure out how does it actually work and my sense is that there's probably you could probably make pretty significant progress just by looking at this specific model or similar models and uh yeah I guess if people are interested um they can read our paper and also if they want to work on this reach out to me and ask about specific ideas I have or just talk to me about things I wonder if it's almost similar to so so there these what there's this like a GPD paper where people are trying to figure out if a has this or if models this is this a Transformer trying to play a or just trying to um yeah so Orel TBT it plays orell but not well like it mainly makes legal moves and the main point there is that the model so that model is trained on sequences of moves and then the main point is that it learns to internally represent the board State and in our case sort of the model already gets the board stateus input and then the main point is that it's like using that to play well somehow I do think it could be interesting to combine those like I mean um people have also trained models to play chess analogously to gbt and have shown similar things there where they sort of represent the board State um so you could see if those models also learn to use that latent representation of a board state to do planning um I think it's probably more challenging than what we did definitely but would be an interesting extension right I think yeah I mean those those models you have to really rely on the probe yeah yeah I think I think from an interoperability perspective the challenge is that now your probe is already like some imperfect representation of the board state but I mean I think that would be an interesting challenge I think the the other question is are those models actually planning or not like for the model we look at it um yeah there are lots of estimates for how good it is exactly it's probably not quite human grandmas level but it's like very strong ATT test and so that was like the main reason going in why I was optimistic that it was doing something interesting um those models train on sequences of moves they have a much harder job obviously they don't get to see the current board State and so they tend to be quite a bit weaker right now and so I'm less confident that they're actually doing something as nice and interesting um and that makes probably harder to understand how they work but um yeah it would be interesting look into yeah that that actually reminds me so one thing you mentioned in the paper maybe it's in a footnote maybe not is is like you say oh yeah we don't necessarily claim that this network is doing planning in every case we just we just try and find a subset of cases where it is so I guess you've mentioned that there are like certain chess problems where you don't notice this look ahead thing um and you know you're looking at chess problems not things broadly in cases where you are not finding um look ahead Behavior with your methods mhm do you think the network is not actually doing look ahead search um or do you think like you're just failing to find it yeah that's a very good question and hard to know um my guess is my guess is probably that that L ahead is always going on to some extent just because I feel like probably new networks have a hard time completely like switching off certain types of circuits contextually but maybe they can if they yeah I don't know um but my best guess would be there's always some of that going on it's just not always as influential for the output like sometimes you just have heris SS that already solve the task and then the lahad just isn't really necessary and if you're bled you know the network still gets the answer right yeah cool well um thanks very much for chatting with me yeah thanks for me it was great this episode was edited by Kate brunot and Amber Don a helped with transcription the opening and closing themes are by Jack Garrett financial support for this episode was provided by the long-term future fund along with patrons such as Alexi maaf to read a transcript of the episode well or to learn how to support the podcast yourself you can visit hrp.net finally if you have any feedback about this podcast you can email me at feedback at axr p.net [Music] [Laughter] [Music] [Music] for [Music]

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs