Library / In focus

Back to Library
AXRPCivilisational risk and strategy

Jaime Sevilla on Forecasting AI

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through Jaime Sevilla on Forecasting AI, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedGovernanceHigh confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 95 full-transcript segments: median 0 · mean 2 · spread -1317 (p10–p90 09) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.

Slice bands
95 slices · p10–p90 09

Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: high.

  • - Emphasizes alignment
  • - Emphasizes policy
  • - Full transcript scored in 95 sequential slices (median slice 0).

Editor note

Use this to calibrate planning horizons before making strategy or policy commitments.

ai-safetytimelinesaxrpcore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video bmJJ0WiPhQ8 · stored Apr 2, 2026 · 2,705 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/jaime-sevilla-on-forecasting-ai.json when you have a listen-based summary.

Show full transcript
[Music] hello everybody this episode I'll be speaking with Hime svia Hime is the director of epoch AI an organization that researches the future trajectory of AI in this conversation we use the term flop a lot flop is short for floating Point operation which just means an computer multiplying or adding two numbers together it's a measure of how much computation is being done links to what we're discussing are available in the description and you can read a transcript at xp. net well hi man welcome to axer thank you for having me here so you study just AI how it's progressing at a high level for people who like have been living under a rock how's AI doing how's it progressing it's going pretty fast uh Daniel uh I think that right now the two things that you need to take into account when you're look thinking about AI is like how fast inputs into AI are going and how fast outputs are going like how much been achieved in terms of inputs like the progress is exponential like the amount of compute that's being used to uh train a modern machine Learning Systems is increasing at a staggering Pace like nearly multiplying by factor of four uh every every year uh in terms of the outputs uh we also have seen some dramatic advances this is a bit harder to to quantify uh naturally but if you look at where image generation was four years ago and you compare it you compare it with today like today we have like Photo realistic a generation whereas before it was like these blobs that you could make that were like bely related to the text uh that you were entering and in text we have also seen like these very dramatic advances where like now I use I use and I suppose that many in the audience will be using CH PT daily to help them with their tasks and coding yeah if people are interested in kind of just statistics about what's going on with uh AI one thing I really recommend they do is you have this dashboard on your website um is it called dashboard or is it called Trends or something I forget what yeah we call it the trends dashboard Trends dashboard okay there um so so that that's one thing that people can use to get a handle on what's going on um one kind of question I have in this domain is so I have an an deck right so it's basically a deck of flash cards um and you know sometimes it shows me a flash card and I say like whether I got the answer right and then it shows to me like very soon if I got it wrong or like some period of time later if I got it right anyway in this deck just some flash cards of like you know how big are various like books or whatever just in terms of number of words to give me a sense of like work counts I also have you know how many uh uh how many floating Point operations were used in training GB3 um I'm wondering if somebody wants to have just a good quantitative sense of what's going on in AI what should they put in their flash card deck yeah the two main things you need to put in your anky deck is the one is the the number I already gave you which is uh the increase in training compute uh per year like right now uh one of the best predictors that we have of performance is the scale of the models and the best way of quantifying the scale we have found out to be the amount of computation that's being used to train the models in terms of the the number of floating Point operations are performed uh during training this is increasing right now for notable machine learning systems at a Bas of like 4X per year and if you look at models in the at the very Frontier you also find like a similar a similar uh a similar uh rate of scaling so I I would recommend you put that number and together uh but that's not the whole picture because alongside uh with the with the compute uh there is also uh improvements that we have seen uh to um architectures ways of training to all these different uh all these different techniques and scientific innovations that allows you to uh to better use the the compu that you have to train more capable system so all of that we usually refer to with the with the name of algorithmic improvements and we had a this cool paper where we try to quantify like okay what's been the rate of algorithmic improvement in language models in particular and what we found is that uh roughly uh roughly the paas was that uh the the uh the amount of resources that you need the amount of compute that you need to reach a certain amount of of performance was decreasing at a rate of of about uh 3x per year um roughly and actually have like a fun uh anecdote about this which is uh just this week uh karpathy uh released like uh he he's been working on this project right where he's trying to to retrain uh gpt2 using modern advances and architectures at like a much cheaper scale right right and uh he estimated he said like this number which is like well you know like we don't know exactly like how much gpt2 costed when it came out in 2019 but uh I estimate that it costed like around like $100,000 to train but with with all these techniques uh the that that he has applied like he has uh he he trained a gpd2 equivalent model for about like 700 bucks wow that's that's that's a lot cheaper um yeah I mean okay there there are a few things yeah let me just get into some of this so in terms of compute right um this is going off this thing I think it was called um training compute of Frontier AI Grows by 4 to 5x per year for a descri title by yourself and eduon how like growing the amount of computation used in training a models by four to 5x per year that seems kind of insane like is there I don't know if you know the answer to this question is there anywhere else in the economy where like the inputs we're putting into some problem are like growing that fast that isn't just like some like tiny you know random minus minuscule thing yeah this is an next question to have I don't have any R examples of the top of my mind but it will definitely be interesting to to hear if there is there are any analoges add to this yeah yeah I guess uh that's if any listeners are have some sort of quantitative econ bent I'd love to know the answer um so like yeah how how did you figure that out um pretty simple so um this all traces back to before even before Epoch I was a bit frustrated with like uh the state-ofthe-art in talking about in talking about uh AI the state of the art on in talking about how AI is going because uh people were trying to be like not very quantitative about it uh whereas where we were already living in this world where uh we already had the systems we could already do like a more systematic study of what's happening so so we I actually started uh together with um my my now colleague Pablo vovos like writing down information of like okay like this is 100 important papers in machine learning this is like the amount of resources es that were used to train the models is the size of the models and such and such and this project has continued up until now and now at OK uh we have uh taken on the mission of keeping this database updated of the not of notable machine learning uh models uh throughout history and uh annotating like how much compute was used uh to train these models this uh this is at many points is like a is a lot of there's a lot of guess work uh involved like people don't use usually report these numbers very straightforwardly so we need to do a bit of detective work in figuring out like all right like in which kind of uh in which kind of cluster was this model trained like which model of gpus did it use for how long was it trained and from there make a sensible estimate of the the amount of computation that was used to train the model sure how how well are you able to do this so one thing I'm thinking of is my understanding is that open AI basically doesn't tell anyone how it made gbd4 like I think I'm not even sure that they've confirmed that it's like using Transformer architecture at all um in in a case like that where they're just not releasing seemingly any information like how how much can you do to figure out like how much they trained it on yeah so the the the answer here is well you know uh some information gets leaked uh eventually there's like some unconfirmed rumors and sometimes you can use them to to paint like an approximate picture of what is happening um you can also look at the performance of the model and compare it with the performance of the model of of of like for which you actually know the compute to try to get an idea of like okay how large do we think the model is like this is obviously not perfect and uh this is the now a a situation that's been increasingly more common for the last couple of years in which like the the the labs that are the frontier of capabilities like are more reluctant to share details about their models this is something that I'm a bit sad about about I think that the world would be a better place if we were more public uh on like the amount of resources that's being used to to to train the systems obviously this has like some implications and this is useful inform information for for your competitors to some extent so it's like understandable to a point that labs are like reluctant to share this information but given the given the stakes of like the development of this technology like I will be I I would be happy if we had like this Collective information on like how many resources do you need to train a model with like a certain amount of capabilities sure I mean so okay one possibility is leaks I sort of see how that could be useful for something like gp4 um in the case of these Frontier models presumably the thing that makes them Frontier models is for at least some of them there are just no comparable models that you know that you could say oh gp4 is about as good as this thing and I know how many how much computation was used in training that thing so like like can you do anything there or is it just uh is it just lease so for example let's walk through how we did the estimate of for the compute of the gini ultra model so for the gemin ultra model like we didn't have the full details of how it was trained but we had like some reasonable guesses about the the the amount of resources that Google has at its disposal and from there we we made like an estimate based on like okay like we think they have like this many dpus and we think that they was trained for like this long uh and this gives us like an an an anchor estimate of that the other thing that we did is uh together with uh with the model like they released results from like series of benchmarks like this is quite common right uh so what we did is look at each of these benchmarks and then uh the in for these benchmarks like we already had uh previous estimates of like the the performance that other models got on these benchmarks and the amount of compute that they had and this allow us to paint like this picture of like uh this is the rough a rough extrapolation of like if you were to increase the scale of these models like which performance would we expect you to achieve uh in that uh from there we kind of like backed out like okay like given that the mini Ultra achieve like this performance in this benchmarks like what uh what per what what how what what scale did we expect it to have and then like what we found is that the the the the the these two ways of producing an estimate like the hardware based one and the Benchmark based one like they they look like Ro kind of similar so that kind of give us confidence and saying like well you know this is uh in the end I guess and we have like huge uncertainty this could be off by a factor of five but it seems like somewhat reasonable to say that y Min Ultra was string with around like let's say five uh five five 101 to the 25 flop okay this is kind of a tangent but like I remember I think like a week or so or two ago I was looking at uh your Trends dashboard um uh I think because I was going to suggest that um some other people look at it um and I had a look at your I had to look at this number like 5 by 10 25 and also I was looking at this um thing this guy called Daniel cutello wrote in 20 it was something like 2021 it was like uh just a vignette for what he thought the next like four or five years would look like and I go to the part of that story that was about 2024 because I was kind of curious right like how well did he do and there's a paragraph in that section where he says like oh yeah you know the year is 2024 um the best model that has being trained has 5 time 10 25 floating Point operations put into it uh and it was kind of freaky that that was so close and in general like like like the lines the graphs you draw of just computation used in Frontier models like the lines seem pretty straight like it seems like this is somehow kind of predictable a kind of smooth process do you know like is there something to say about what's driving that smoothness like why is so regular yeah this is a this is like a really interesting question and one that keeps me awake at night like where where do these straight lines come from uh right now my best guess is okay so maybe let's dive a bit into like what goes into comput like what makes compute numbers go up like and I will say that there's like two major things that go into that uh one of them is that uh Hardware becomes more efficient over time so uh then we have machines that can have a greater performance so for like like a given budget of money you can get uh you can get more uh compute out of it it's actually a pretty small number compared to the the growth that we're seeing in in comput so like improvements in Hardware efficiency at like a fixed level of performance uh of precision having improved by around like 35% per year among like gpus that uh have been used for machine learning training uh in the last 10 years or so okay but you know the the trending compute is like 4X right which is like five times greater than uh than this Improvement in Hardware efficiency so what explains like the the rest of the difference well a bit of that is because people have been training sorry it's uh the growth in compute is 4ex per year yeah uh the grth in compute efficiency per dollar is 35% per year that's right wouldn't that be like 12x as much well uh I recommend that you think about this in terms of ss uh per year okay so uh because I think that that helps uh better like paint the picture so uh 4X per year is roughly like 0.6 uh orders of magnitude so om is an an order of magnitude yeah so uh 4X is for 4.6 uh orders of magnitude per year and 35 uh 35% is like roughly like 0.12 uh orders of magnitude uh per year okay so it's like um four so it's 4X in terms of the number of orders of magnitude per year that's right okay gotcha so sorry I um I cut you off there but um you're like yeah more like 5x sorry more like 5x but yeah you were saying that uh the growth in computation used is just like way faster than the growth in efficiency of like how well we can make gpus that's right exactly so so what what what is happen like what is missing like why why are these numbers going so high and uh I will say that the there there is like a couple of like less important factors here like uh people having training for longer which matters to an extent also uh people have been switching to formats of precision which like uh from which they can get more performance like switching from Recently from F fp16 to mix f f fb8 uh Precision sorry fp16 is just like how many it's like floating points 16 bits it's roughly like how many significant digits use yes yes and so they were using something like 16 significant digits now they're using something like eight yes that's right okay and uh but but the most important factor overall is just that people are willing to put in like so much more money uh in into this and now uh of course like this raises a natural question which is like how do they decide how much money to put in and why is this why have they decided to scale the amount of money that's being put in at a rate that that results in like this smooth uh growth and here I don't really have like a very authoritative questions I have like some some guesses like one of them is that for example there was a recent interview that amavis gave where like he was like well you know like when you're scaling models like this is a bit of like a an art and you want to be careful that you don't like uh train a model for very long but then like the the training run like goes horribly wrong like you really want to you really want to have like this learning curve in which like you progressively learn how to train like larger and larger models and test out like your different hypothesis about like which techniques are going to work at a scale so this is like one story of like why people are choosing to scale so smoothly which is that they believe that they will learn a lot uh over the over the process of training uh smoothly and not like waste a lot of resources on a training run uh that might not go uh anywhere okay yeah that there's another there's perhaps like another explanation that we could give here which is that um like doing a training run that's very large it's like very expensive right so again like gpt2 in 2019 like it was $100,000 to train but uh 5 years later it was like so much cheaper right so to an extent you want to wait uh you want to wait uh to do your very large training runs because you're going to have like much better much better ideas about how to make the most out of your training run like better algorithmic innovations that help you um make the most out of the computer that you have and also to to an a smaller extent like you're going to have better Hardware which I've already said it's not that big of a deal but it's still a deal like in some sense this is a reverse interest rate right like your money is more useful later than it is now exactly that's huh that's kind of a weird thing to think it this feels like the kind of thing that somebody would have studied like oh like what to do when interest rates work that way or something I don't know maybe I'm thinking about it weirdly but uh so actually one thing that I just thought of so you were telling me about like this reversed interest rate and like this this phenomena where your money is is so useful in the future and one one fun observation is that to an extent this limits how long people are going to be willing to train models for because uh if you if if your training run just takes like a very long time then at some point it will have you will have been better off just uh starting later and just doing like a shorter training run but with like uh with like uh the more the increases in efficiency that will come associated with doing a training run uh later like there was this um like one one interesting analog here is that uh in the 90s there was like this project to sequence uh the human human human DNA I'm not super familiar with the details uh but uh if I recall correctly there was like a first project that tried to do this and using techn using like earlier technology and that went on for like many many years and they were beaten by a project that started later because uh some years later there was like a better technology was like so much more efficient they were able to like finish sequencing the The genome like much faster than the project that had started earlier so this is a situation that might be analogous in in AI in which like if you if your plan is just to do like a 10year training run then you're going to be like completely out class by people who like start in the last year and just like use like this much better uh kinds of hardware and much better uh algorithmic uh insights to train like a model that's going to be uh far better than uh the 10year training run sure so actually this gets to a question I had about this which is that like so you know it takes some amount of time for models to train right and if you're deciding like like like you have this graph with little dots right you know the dot you know the x coordinate of the dot is like when the model was trained and the y coordinate is how much computation was used but like you have to pick a date for when the x coordinate is and like as far as I can tell your like if if computation is growing at like four to 5x per year then it seems like it really matters if you put that dot at the start of the training run versus at the end of the training run if it takes like more than one month to train these things which I think it does like how do you make that choice to to Fig figure out how to think about like when a model is trained yeah uh this is an excellent excellent question so right now uh what we do just pragmatically is the the date that we choose uh is the date where the model becomes uh public and becomes like officially uh officially released uh that's that's what we what we pick just because when when you don't you don't some many times you don't know when people really start the training yeah I wonder if that means that like they're be an apparent like massive boost to these times just by a company deciding to release their results like like announc their model slightly earlier or like but like like if a if a company decides to like move the date in which they announce a model forward or backward by one month that's going to like make a difference to this trend right it will it will yeah absolutely so so you're going to maybe like the the best way of thinking about it is that one is the the scale of the models that we have access to and the other is the scale of the models that are being trained right now or that companies are internally testing and you should expect uh you should expect this the the models that are uh internal to be like uh potentially like 2X or even 4X larger than the models that exist right now if if we're like increasing the amount of computation that's being used to train these models uh so quickly like there's only so many computers that exist in the world right at what point are we just going to run out of computers to train stuff on this is a this is an excellent question so uh in order to conduct this training runs the the the magical thing that you need are uh gpus or Accel like Hardware accelerators that like optimized uh to do these large uh training runs which mainly uh consist of Matrix multiplications yeah and uh right now there's essentially like one seller in the world that produces this that sells this gpus uh that uh relies on the services of like one company in the world that's produ producing them uh this this very unusual situation in which uh the the the supply chain for like these uh for gpus is incredibly concentrated uh right so so these uh these companies I'm talking about is NVIDIA is the one who's designing them an US based company and the The Foundry that's actually producing and packaging the the gpus is tsmc uh in Taiwan so the Taiwan semiconductor manufacturing that's right company okay so like very small number of companies actually making these you know computers that's right and uh each of them like roughly accounts for like 90% of their respective Market in in terms of design and in terms of uh Manu manufacturing and uh this leads to a situation in which like uh there are there have been historically some shortages of for example like uh in the last year there was this uh the release of this new uh GPU the h100 uh that uh uh that that people and people really wanted to have h100s for training they're very good for training so uh what they found out is that uh they they quickly ran out of the of their offer like uh they they couldn't meet the demand at least immediately and they had to massively expand uh manufacturing capacity in order to meet uh the growing the growing demand um for for these h100 uh GP gpus and this naturally raises the question about like how many how many gpus can they produce uh at this moment and like uh how much they can expand that uh in the in the future this is something that we're like actively trying to Grapple with uh at the at the question maybe I can give you like a bit more insights into like what's limiting what's what's limiting the the the increasing capacity and right now like I will say that the the the three main factors that are a physical limiting increases in capacity of GPU production are uh first of all uh packaging equipment so like once you have uh the the process for producing gpus is first you you create like a a wafer uh which has the necessary semiconductors for it and then like you need to package that together with like a highbank with memory and the other components that make a GPU and Soldier everything together into like the actual physical GPU that you plug into your data centers okay all right the the the the technology uh for the for doing that is called cheap on wafer on substrat uh substr technology or Coos technology and uh the the uh but right right now uh my understanding is that uh people are really limited on the amount of machines that can do this Coos uh packaging and that has uh that that's uh why they weren't able to produce as many h100s as they could have sold uh essentially okay together with that uh you also need the high bang with memory uh units and this could potentially become a bottleneck I'm a bit less informed there and uh this is what's limiting production right now but in the future what might limit it is the the production of the Wafers uh that have the semiconductors uh in the first place which might be quite tricky to uh to scale up because uh in order to produce those Wafers you need Advanced lithography machines that right now also being produced by a single company in the world uh which is a asml uh Holland so in the long term uh the the growth rate of like these uh uh waare production capabilities might determine like the growth rate of a GPU production now these are the the physical factors this is like the physical reasons why dsmc is not producing like more gpus that they could that they could sell to Nvidia so that Nvidia can sell to it to its customers uh but there there might also be some uh commercial SL social factors that play here uh for example like uh TSM tsmc could definitely like my understanding is that it could definitely like raise prices for like their gpus like Nvidia will be willing to play to pay them like uh more money and spend more of their margin on on tsmc uh but they if they do that they're going to drive away some of their other customers and uh they might be scared of like overc committing to this AI Market where like they are not sure whether this is going to be like a temporary fat or something that's going to uh that that's that's going to sustain their business in the in the long term and this might be a reason why right now they're a bit scared of like dedicating lots of resources to producing the chips that are used for for a training if they become if they became more bullish on AI it's it's plus that they will invest the necessary resources to like massively expand capacity which to an extent they're already doing but even more than that so that they can keep maintaining like this increas in demand for uh for for gpus okay if I'm thinking about this so a while ago I just committed this number to memory which is roughly 10 to 31 floating Point operations and the way I came up with that was I said okay what's some crappy estimate of like how many floating Point operations I can buy for $1 you know uh or you know the per dollar cost of floating Point operations on really good gpus and then I multiplied that by what's the gross World product in in 2019 you know just like how What's the total amount of goods bought and sold in some sense this is kind of a dumb estimate because you know it might cost them more if they had to like you know make more machines or whatever and also like it's sort of weird to like in a world where we spent like 100% of the gross World product on computer chips that world would looked very different like how are people buying food or whatever but like that was my weird estimate of just like how much computation should i r roughly expect to see until we run out of computers do you think my like how good an estimate is that like am I off by one order of magnitude or like 10 orders magnitude let me think about this uh for for a second like this is a good question so right now um to to give you uh to give you an idea uh the amount of uh the amount of uh of of gpus that uh of stateof the gpus that tsmc is producing is in the order of like 1 million per year for like the h100 series okay okay uh each h100 like has a capacity of like uh 10 to the 15 flop per second okay okay and this is for um this is for fp16 if I'm not mistaken okay how many how many seconds are there in 100 days I think that's like 10 to the 7 uh 10 to the 7 seconds if I'm not mistaken um oh you I I'll do that math and you can keep going nice excellent so uh we have here like six six orders of magnitude uh gpus uh in the six order of magnitudes uh flop per second in the 15 order of magnitude and a second see in seven order of magnitudes pending conf confirmation so like uh roughly if you add all of these uh all of these together then like uh you will have uh 7 + 6 that's 13 + 15 this is 28 so like you will end up with like a a a flop budget per 100 days of like a 10 to the 28 uh flop okay uh so this is uh this is roughly like if if if people were magically able to like gather all the h100 gpus that are being produced right now and like put them together in the data center and use them use them for training which like will have lots of complications associated with it to be clear like they might be able to train a model of up to 10 to the 28 flop which will be like three orders of magnitude greater than GPT 4 okay um yeah I have uh roughly 10 of the 7 seconds in 100 days nice uh so 5 time 10^ the 25 is what we currently have and 10 the 28 is what we could do if people spend 100 days training I guess you could spend like a couple hundred days training but maybe a thousand days training is like should we expect people to at some point just train for longer uh yeah three years of training this is actually a fun ongoing conversation uh within ebook which is whether we expect training runs to go for longer or higher so so I already talked about like what's the incentive to do it for lower which is that you have all these algorithmic improvements are happening and Hardware also gets better over time which naturally makes it uh to to to want you to shorten uh your your train runs but the reasons why you might want to lengthen it like one of them is just like raw output right like if you train for like 10 times longer then you you get like 10 times as much compute and this is like something pretty straightforward uh also uh if you train for if you train for longer that means you need less GPU to reach like a given level of performance right and also you need like less power to to power this those gpus uh in the in the first in the first place why oh oh just like less Jewels per second because you have fewer gpus and so in any given second you're using fewer Jewels that's right okay that's right um so uh so so uh the the there's like these kind of incentives for like uh for like training for like training for longer and at this point it's like not obviously clear like which way uh the balance the balance tips uh at this moment and uh particular what we have seen historically is that there is not a clear Trend in in training runs but there's overall an increase uh that we have seen from like people training for like one month uh like uh seven years ago to training for like three months uh right now like this is kind of um for the training rounds where we have information like training for 100 days seems to be um something that at typical a training run length I think right now my very weak epistemic status is I expect them to I expect training runs to become longer and I could expect them to become like up to twice as long and maybe up to three times as long as they have now longer than that it starts becoming like a harder ask um because of these factors I mentioned and also because just like sustaining a training run for like for more than a year right like that's technically very challenging right just there's some rate of like random things going wrong like something crashes like there's power outage or something that's right gotcha okay just just think about this number of just how much computation is there available for AI training so um you're saying like T of 28 flops flops being floating Point operations uh total uh if you want to train for 100 days uh on like one year of TS of somebody's production of h100 gpus right that's right do you have a sense of like how that number grows over time like in 2030 is that number going to be like 10th of 29 or is it going to be like 10 to the 35 so uh this is something where I don't yet have very welldeveloped intuitions we're looking into that uh into this at the moment my sense is that uh this could possibly go up by a uh by by by a by a factor of 10 somewhat easily that will correspond to so if I I really don't know these numbers of of the top of my head do you want me do you want me to check them uh yeah sure yeah do that okay so uh my understanding is that right now uh tsmc is dedicating around 5% of their uh W Advan Advanced node waare production to making Nvidia gpus okay uh I could see I could I think it's quite plausible that that might increase by uh by by an order of magnitude if they decide to prioritize a o and if they are able to solve the packaging constraints and high memory bang with constraints I mentioned earlier so I think it's quite uh it's it's quite plausible uh to to that they will be able to uh to produce like and to be producing like up to 10 million gpus a state of the art gpus per year which will uh if you were to train on that uh will will uh will allow you to reach scales of like up to like 10 to 29 FL or so maybe maybe you you increase this a bit because like you're not you might not only want use the production from a single year maybe you stock pile and use the production from previous years as well and also um there might be a there will be like a few advances in um the the um the performance of the gpus like we will have better gpus uh in the future like we we already have the B 200 in the in the Horizon and there will be more in the future I I am sure so I think that all in all I think it's quite uh is it's it's quite uh it would be good it's good to think that by the end of the year uh we will will be if if you were to dedicate like the whole uh the whole um the whole uh production of of gpus on a single training run like you can possibly reach like up to 10 to the 30 flop that that seems reasonable you know it's quite unlikely that you're going to be able to use the whole prod the whole stock of of like gpus on a single train run like there's uh first of all like companies are going to fight uh with each other like uh different actors are going to want to just do their own training run which means that the resources are going to be natural split and perhaps like some um um companies will want to use a large part of their uh DP resources uh for inference like for example like right now if you just look at if you just look at uh er um Facebook so Facebook plans to have like a a a bought bought like 100,000 h100 gpus just last year and they plan to have like a fleet equivalent of like 600,000 h100 gpus by the end of the year if I'm not mistaken but they're not using that many resources on their training runs there like a much smaller only a small fraction of those for for training where most of it is is being used presumably for inference right now and this might continue in the future depending on how much how much value Labs assign to uh to develop in these new models yeah I think there was wasn't there an e post by EGA a saying that like you should expect companies to use about as much computation and inference as training that's right so uh this is this he had like these uh neat theoretical argument where like he argued like um so there are ways in in which you can uh train a model for longer and make it without altering performance but make it more efficient at inference like the the most straightforward way of doing this is uh you train a smaller model but for longer right so that in the end uh it has the same it has the same performance but this a smaller model it's going to be like more efficient to run uh at inference time right and if you think about this and you put yourself in a from the perspective of a company that wants to reduce the total compute that you use between training that inference and naturally what you want to do is like uh try to H try to make these two equal because that's what going to minimize like the uh the the total EXP of compute uh that you're going to be that you're going to be making there's some caveats here like inference is usually uh less uh less efficient uh that than than training like the the economics for inference and and training like are not exactly the same and uh right now my understanding is that this is not happening uh well I think this is not happening in companies like meta I think that this is actually this might be quite well possibly happening companies like open AI in which like they use like very large amount of resources for training like it is quite plausible that given what we know about how much inference is going on at uh at open AI that their yearly budget for inference is is similar to their yearly budget for uh training but yeah at this point uh this is something that I think is informative and I think it has like this neat compelling Force as an argument but it's also something that I want to get more evidence on whether this is actually going on before relying on it a lot fair enough so way earlier when I was asking you how AI was going the second thing you mentioned was um algorithmic improvements so I think the thing that um a post people can read about this is algorithmic progress in language models by ansen ho at all well it's a blog post and it's also a paper if you've uh if you've got time for that um so basically this was saying that there's um or the way I parted it was that there's like something like uh 2x speed up in training models to the same loss every 8 months or so like maybe the error bar was 5 months to 14 months so my understanding is the the way this worked is that you picked some like degree of loss that you wanted to get to so some like level of performance and then you tracked like how much uh computation would it take you over the years to reach this level of performance and that's where you get this number from yes I no so this is kind of the the abstraction that we're going for and we're hoping that this will be one of the applications and interpret ations of the model that that we built okay but uh in the end what we did is is actually like the data that you have is very sparse like there's very few people who are doing like exactly that kind of experiment exactly the level of performance that you care about so what we did is a bit more General in that we we just look at like lots of models that had been already trained this is like an observational uh study and we looked at like uh which performance did they achieve in terms of uh in terms of uh uh of of of per of perplexity of like like uh the loss that they achieved on the the the text uh how well they were able to simulate the test the text that they were uh that they were tested on how what was their scale in terms of uh model size uh and data and then also in which year uh they were trained and essentially we just fit like this regression which is like well you know I'm giving you like the the the model size I'm giving you the the amount of data and giving you the year that was trained on and it's like uh what what what do you predict that this model tries to predict like what is the performance that the model achieves and we fit like 14 different ways of 14 different equations that try to combine these terms in different ways and finally choose one of them that uh intuitively seems to to resonate with how we think that uh a scale and performance relates to each other and also the role that we think algorithmic improvements plays uh into into this part of what I'm wondering here is like how sensitive are is the number just to the specification of how exactly you define the question you're asking um like like conceptually if I picked a different lust level um would that change it from like every eight months to every 3 months or to every like two years yeah so I'm going to introduce you to the ban of my existence which is um aale dependence yes so uh through this model like one big assumption that we introduce is that this algorithmic improvements they uh they they uh they work independently of the scale that the models are trained on so like we here here we made like this arely and realistic assumption which is that uh if you if you come up with like a new architecture then like new architecture is going to is going to help you uh get like a fixed level of improvement no matter if no matter if you are uh just training at like a larger scale or like a smaller scale uh now uh this is arguably how not not how things work like it could be quite plausible that uh for example like we think that the Transformer scales much better than uh uh recurrent architectures right now but this might be only up to only if you have like enough uh enough scale for like the transformer to kind of like kick in and like a start at this like uh great properties of scaling like if you're working below that scale you might be better off with like a simple with like a simpler uh kind of kind of architecture like this is not uh something implausible to me okay what this means is that uh uh what this means is that our estimates uh might not be uh sufficiently accounting for like this this difference in like how you should expect uh improvements to be better at the frontier of compute or whether you should expect them to be better uh at the uh for like a small a small scale budgets and this matters because there are two reasons why you care about algorithmic improvements like one of them is that they help Frontier runs to be like much more efficient and reach like new capabilities the one that you care about this is because this helps like a a wider amount of people uh train models with like a certain level of capabilities right right so depending on which these two use cases you you care about you're going to care more about like innovations that work better at the scale or innovations that work better at a small at a small scale and small compute budgets fair enough uh this is something where I will want to have like a better more scientific understanding of uh to which extent this is the case and try to look at the specific techniques uh that drive this algorithmic Improvement and try to see like at which scale were they first first discovered and in which scales do they apply and does the efficiency of the technique like change depending on on at which scale it is applied like I wouldn't be surprised to find that but right now we don't yet have like a systematic study uh showing this one thing that I'm kind of confused about when I'm thinking of uh algorithmic improvements is uh people authoritatively tell me that there's these things called scaling laws for language models and these scaling laws say like look it's this formula and you put in how many parameters your model your model has and you put in how many you know tokens you're training your language model on and it outputs you know what loss you expect to reach on your data set I thought that if I knew the number of tokens you were training on and I knew the number of parameters your model had and if I assumed that you were looking at each uh token just one single time so only training for one Epoch which I gather is what people seem to be doing I thought I could figure out how much computation you using to reach a given loss so are algorithmic improvements are those things that are changing the scaling laws or are those like ways of better using computation within the confines of those scaling laws or something else so scaling laws are defined uh with respect to a specific Training Method and with respect to a specific architecture so for example in the in the famous chinchila scaling LW paper by Hoffman doll from Deep Mind they they study like this particular setup in which like they have a Transformer and they Define like very precisely like okay how are we going to scale this how we going to be making this making this bigger there's lots of prescriptions that go into this like for example when you scale a model you have lots of degrees of freedom on like whether you add more layers so whether you make the the model uh uh wider to have more neurons per layer right so so uh these these are all like other considerations that affect like the the end the end result like the scaling loss that you're going to to fit and kind of like you can think of uh you can think of uh these algorithmic improvements as going beyond the specifications of like the particular confines in which the scaling law was originally studied and try to introduce like well different uh different uh tricks like new new attention mechanisms in the architecture or like maybe you're training on like a different kind of data that has higher quality and allows you to train uh to train more efficiently maybe you change the shape of the model in like a different way these are all the ways that you you can escape the the confines of the scaling laws so is this basically just saying that scaling laws like I kind of thought of scaling laws as somehow these like facts of nature or whatever but but it sounds like you're saying they're just they're just like significantly more contingent than I'm thinking is that right I think that uh this is this is right in an an important sense and maybe there's like a wider sense in which maybe they're a bit more General so in in the specific sense is like in the experiments in this scaling laws like we see that is like uh the the the the the scaling the scaling laws is study like a particular uh a particular setup right and you know we can make some assumptions about how much this is going to generalize but it's it's kind of tricky and there's right now like dozens of scaling loss paper that study different setups and arrive at like different slightly different conclusions again to to kind of stick in this scaling laws frame so you've written down a sample of scaling laws on your piece of paper so total loss is some like irreducible loss which we're going to call E plus some constant divided by number of tokens some power uh plus some constant divided by number of parameters to some power that's right and basically like the things that determine a scaling law are what you think the able loss losses are and then for parameters and data set size is what you think the like exponents of those are and also what the like um you know constant factors at the front are that's right if I'm thinking about algorithmic improvements are those mostly changing the exponents or are they changing the constant factors like it seems like this is going to matter a lot right it does matter a lot and this goes back to scale scale efficiency so if you were changing the exponents then like uh the the the efficiency of the improvements will change uh with the scale but the way that we model it in the uh the we model it in the paper is essentially as a multiplier to the effective model size and the effective data set size okay that you have I I guess that's not quite changing any of the constants but it's like yeah somehow you're using something instead of N and okay okay I guess I guess that kind of suggests that maybe you could create a meta scaling law where like um you have this scaling law with these constants and the constants are like varying over time or something is is is this meta scaling law is it like easy to write down I feel like someone could get that on a t-shirt you know yes I mean that we have 14 different candidates for that in our in our paper all right so I guess people should uh take a look part of the reason it seems like um someone would want to ask this question is basically trying to ask okay if I'm thinking about AI getting better is that mostly because of increasing computation or is that mostly because of algorithmic progress or is that mostly because of increasing the amount of data we putting into models um I I think you frame it like this to some degree in the blog post that's right one way that this could be kind of misleading would be if algorithmic progress were driven to some degree by increasing availability of computation right for instance maybe if you have way more computers you can run way more experiments and you can figure out just better algorithms to run do you know to what degree this is what's driving it yeah you're you're right on the money here like this is uh this this is uh something that uh will be quite crucial for like how AI will play in the future at the extent to which this algorithmic Innovation is itself uh being driven by uh being driven by compute and uh to be honest like I don't have here like a great answer at the moment like my personal belief is that it is it actually plays like a large factor and this is being informed by some informal uh conversations uh that I've had with people in different labs and rumors I've heard where like people say like oh we're very limited by comput like even we we don't hire like more researchers because a researcher will just take up like Pres compute resources that our our researchers already using for like trying to come up with like better ways of training the models so it seems that to a degree at least in some uh some Labs uh people have this notion of like our research is compute bound our research is also been uh greatly determined by the access that we have to Computing resources we and this sounds like quite reasonable to me like the main way that we learn things in science is you you run an experiment you see if it works and then you you try it you try try it again in like slightly different variations and particularly like it seems that uh scale like is testing whether a a problem is scales testing whether a technique scales uh is is very very important for making like this uh this this advances so uh that naturally like limits a lot like the the the amount of advances that you can do if you are uh constrained uh by uh by a compute budget and uh this might have like this huge relevance on like how the future plays out because imagine that we we're in a world in which like you're bottom neck by ideas you're bottom neck by having more researchers that can do that like possibly in the future we're going to have ai that's going to be like really good at coming up with ideas and like substituting for like uh for like uh human scientists and like coming up with this hypothesis of what you should train and if you are bottleneck by ideas like this could like quickly lead to like very massive improvements in in algorithmic efficiency but instead if instead you're being botton like by compute then is like well sure you have like all of these uh all all of these uh virtual researchers that are going to be working for you and like coming up with great ideas but they're still going to have to go through like this bottleneck of like we need to test their ideas and see which ones uh which ones work and they might be more efficient and coming up with ideas and still like this could lead to like a substantial increase in algorithmic progress over time but uh this this might be like much more moderate than in the other world so I guess this gets to the question of what the Advent of you know really superhuman AI is going to look like and yeah I think classically people have thought of um you know we're just going to have like tons of ai ai researchers um I mean if the bottleneck is compute like compute doesn't just we we don't like just get it from rocks right like some people are like building machines and figuring out how to like do things more efficiently does that suggest that the singularity is going to be like entirely contained within like the Taiwan semiconductor manufacturing company uh um fun question I mean right now like the the the two parts that go into AI progress is uh you have the hardware manufacturing but then you also have like the the software companies that are like a completely separate entity that are like coming up with this ways of of training them so like by default I guess that we we we will expect like something like that but there will be like this really vested interest on like uh on on the the the both the semi factoring company but also the a companies to apply AI to uh to increase their own uh their own uh productivity I think particularly like this this this is very naturally happens within aaps especially because a is very good at coding it's very good at like the things that uh that are useful uh for doing a research I think it's very natural that people will want to see can we use this to uh to improve the productivity of like our very expensive uh employees on Hardware manufacturing it also feels like this natural this natural like multiplier where like if you are able to use AI to like increase the productivity of DMC then sure they're going to be able to like produce much more this is going to lower the prices of compute this going to allow you to train like even even larger and better models that uh help you uh achieve better levels of like generality and and capability so to an extent uh I think that the the the intuitions that I have is that I do expect that uh some of the early use cases for like very dramatic increases in productivity are going to be in a companies and I will not be surprised if semi factoring companies semi semiconductor manufacturing companies are next in line okay and I guess that suggests so a lot of just work being done both in AI in general and a lot of what you're checking is you know scaling laws for like language modeling for these predictive tasks I wonder if that suggests that like actually we should be thinking much more about AI progress in whatever good analog of making computer chips is like I don't even know what a good Benchmark for that is in the AI space um I don't know do you have thoughts about that so uh what you I think what you were pointing at is like oh maybe one one type of task that we really care about is to which degree AI is going to be helpful for like improving um chip design for improving like for participating in the processes that go on within a a SE semiconductor manufacturing company is that what you were pointing that yeah yeah I think that that's uh I think that this is right to an extent uh It's tricky to design good benchmarks that really capture what you what you care about like The Benchmark you really care about is like does it actually improve the productivity which is something you will see uh in in the future once you get the models deployed uh there but it will be interesting to start developing like some uh Toy settings which try to get at the at the core of like oh what will it mean to increase the capacity of this model so like for example I one of my colleagues at deok JS like uh he has been thinking about like oh what kind of mmks could be cool to have and would be ative about what we're thinking about in Ai and one of the things he was considering this is more on the software side but he was considering like oh can we predict like uh can we have a benchmark that's about predicting the results of like an a experiment and this is again this is more on the a company uh side but this is this will act as a this could act as like a compute multiplier right because if you have like 10 ideas but uh if if if you have only compute to test uh 10 ideas uh then uh you you you want to be picky with like which ideas you test and it's better if you have like these powerful intuitions about which ideas might work so to the extent that AI can help you provide with these intuitions and guide your search on which techniques to try like it's going to allow you to it's going to allow you to effectively like uh uh test uh increase the the range of options that you're considering quickly discard ideas that you think are not going to work and really focus on the ones that are worth testing and trying at a scale sure so okay we've talked a bit about algorithms we've talked a bit about computation um I think a lot of people think of AI is basically this three Factor model there's like algorithms there's computation there's data you pour those three into a big um you know Data Center and out plops gp5 right um is this basically a good model or is there something important that we're missing here so I think that this is a good model though I will uh I will I will make like this distinction uh which is that uh you really care about which constraints are are thought at a at a given moment and at this moment like I will say that compute is a is a thought constraint whereas data is not okay so right now uh models that we have are being trained on like uh for for the ones where we know the data set size like they use around like 50 trillion tokens of data okay and uh the size of common craw for example uh is like 100 trillion tokens of of data roughly and and common craw is just roughly the internet or is it like half the internet uh it's like a fifth of the internet so if you look at index uh the the amount of content in indexed uh public text Data uh that would be like five uh 500 trillion tokens of data okay so common craw is 100 trillion tokens and people should think of a token as being like 80% of a word on average yeah roughly roughly 100 trillion words of data that you can get from common craw that's right um and so so you're saying that data is not the to constraint right now it is not quote unquote maybe there's some uncertainty here maybe in the future a companies won't be able to use the publicly index data anymore to train their their models uh there's some complications here and also uh there's some domains for which like you really want to have like data there like if you really care about uh if you really care about uh accelerating experiments you probably want to have data about coding you want to have high quality data about uh reasoning on AI uh you you might really want to expand uh those kind of data but yeah to to first order approximation like the reason why I think we're not seeing larger scale training is because there AR enough gpus like the if people had more gpus they will find out ways of gathering the the necessary data so in that sense I think that comput and algorithms are like more important to track uh at at the at the current margin that data is okay well even though it's less important than the other things I do want to talk about it um because you had this interesting uh uh paper I mostly just read the blog uh will we run out of data limits of large language model scaling based on human generated data this is by Pablo Vios and colleagues um so my understanding of just the Topline results is that there's um roughly on the public internet there's 3 * 10 14 tokens that you can train on uh which uh is like 300 trillion if I can do math which is unclear um so roughly that many tokens train on and you would have a model with roughly like 5 time 10 to the 28 floating Point operations that you use to train on that and roughly in 2028 we'll just be using all the training data uh is is that like roughly correct summary or is there like something important that I'm missing that I think that that sounds about right so it's interesting to compare this with the uh with with the the size of the of the um we were talking before about like oh if you were using like all h100s that are producing a year like what's the largest mold that we can train and we arrived at like uh 10 to 28 flow Pur so right uh if you if you were using like this 100 trillion tokens of data in order to train it like in order to com in order to estimate like what's the the largest model that you could train like maybe one approximation that you can do here is uh think about uh the the chinchila scaling laws that inform you about like with this amount of data like what's the largest model you can train like roughly you you will be using like uh you want to use like 20 tokens per parameter of the model okay um for for for training it according to cintilla Optimal scaling lots of quotes on quotes here um okay uh but uh so so that will mean that if you're using like if you're using like let's say you use like these four 400 trillion par tokens of data that are all in the indexed uh web and that will allow you to train like a a model that has um that has um uh math trillion 20 trillion uh parameters uh so then uh that that will lead you to uh amount an amount of compute so the amount of compute uh is is roughly six times the amount of data uh times the amount of uh of parameters uh so that's going to be uh he this is uh 44 2 e14 so that's uh NOP 2 e 13 this it's 400 Tron but trillion but only 20 trillion oh yeah that's right you are right and here I multiply by six uh so this is going to be 8 * 6 uh that's uh 8 * 6 is uh uh well it's 12 by 4 which is 48 yes so so 50 essentially so uh 5 a28 yeah which is what we arrived at before right uh yep 5 10 to 28 oh man it's nice to nice to see that in action I was I was actually wondering where that number came from uh okay cool nice so so so what you see is that uh now there is a there is enough data on the on the index web uh to to train like a a to train something that would be five times greater than what you will be able to train with like all the gpus in the world yeah so this is kind of interesting to me because there a few coincidences here right like yeah one one is the thing that you're just noting that um uh amount of data you can use like like if you had 5 years of global GPU production you'll be able to train on all of it uh chinchilla optimally um another thing I noticed is I looked back on the training compute growth paper and I looked at okay what's the biggest model right now and when do we hit 5 * 10 to 28 floating Point operations and roughly uh I don't know depending on like whether it's like 4X or 5x and depending on whether I can multiply it's somewhere between 2028 and 2030 so like somehow there's this kind of strange coincidence where like if you extrapolate compute growth you get enough compute to train on the whole internet also just at the time when we are projected to train on the whole internet is that is that a coincidence or is that just a necessary consequence of Frontier models being trained roughly optimally no I think this is a a coincidence like what has been driving the amount of data on the Internet is just like a adoption of the internet and user penetration rates which have nothing to do uh with with AI and and gpus so I think this is just like this yeah happy coincidence well the 2028 number was when it it was extrapolating how much data models were being trained on right so so that does have to do with AI H sorry so like these uh the the number that we deriv now like this 5 * 10 to the 28 like this is based on like the amount of data on the index web which has nothing to do with AI right right right right like like what what has to do is like when you do this extrapolation is like when when do you hit that amount of compute but you know like that that because this is similar to like like what what is a CO what this a coincidence is that the amount of the the the training run that uh that uh that you can do on this amount of data is so similar to the amount of the the training run that you can do on uh the amount of compute uh for the for the for the GPU gotcha so now I want to ask a bit about the future of AI so a while ago you wrote this uh you guys put out this post called the direct approach which was roughly saying okay here's a way of turning loss into like how much text an AI can write before you can like distinguish it from a human um and roughly it was a way of saying okay if you want an AI to that is smart enough to write a tweet just like a human could you know that happens at like this amount of computation if you want to get an I that can write a you know scientific manuscript about as well as a human can that happens in this amount of computation modul some fudge factor which is like very interesting to talk about but if I looked at those numbers it said like maybe I needed somewhere between like 1 * 10 30 and 3 * 10 to 35 floating Point operations to get AGI that could write you know be roughly as smart as people um but when I'm training on all the publicly available data on the internet I only got that was only enough to use with 5 time 10 28 floating Point operations does that mean that scaling just isn't going to work as a way of getting AGI so um I I will say two things here uh the the first one is that uh I I will mostly think about this method about as trying to estimate like this upper bound because presumably like you don't need to be able to mimic humans perfectly in order to uh write uh scientific manuscripts of of perfect quality right like this is like this kind of this unrealistic goal in which like your model is like so good at mimicry that you cannot tell tell it apart but it it doesn't have to get to be that good in order to like be have like a transformative effect on the economy or produce manuscripts uh of quality like it's much harder to write uh to to to to write uh something that like mimicking someone to write something useful right fair enough fair enough the the part that uh the the the other thing that I will say is that like I I will caution people to take these numbers like uh very seriously like I think that right now we just don't have like a really good understanding of like when do you hit like the the when do you hit different levels of capabilities and at which scales like I think we we have like this rough notion about yes once you increase in the scale you get like uh you get like uh more and more uh you get more and more capabilities and more uh more generality and if you combine that with like certain scaffolding techniques like this might lead to like a AI That's like WI very useful when it comes down to saying like well this is going to happen at like this amount of flop exactly like it's it's it's a very rough job there is like some there's like maybe some suggestive numbers that I will float around like one of them is this uh that comes out of this paper like trying to estimate like okay in this kind of like setting like how much a computer you will need to to train a model uh if like a scaling lws uh can be sustained for like 10 more orders of magnitude which is also another big big if uh what are what are numbers are like suggestive to think about so one one thing that's uh quite uh quite interesting is uh so I forget exactly who it was uh right now it might have been corwell who like 20 years ago like they made some predictions about uh when will we have AI That's essentially that that will like pass the during test and uh they said something like well you know we forecast the the point where you will have like enough compute uh to match the human brain well happened like somewhere in the in the uh in the 20s and that happened which is insane right like they they got that right like it it's actually true that we have now models that essentially pass the during testing that they can converse with like with with with with humans and and have like a meaningful conversation with them so it's it's quite insane that just by looking at this uh biological construct of like the amount of the amount of computation going in the brain and with like some wild back of envelope calcul they were able to do that how can we go about like is there like a an analogous thing that we can do to talk about like when we will have ai that's like really good and can do essentially everything that humans can do so uh there was this report uh by AA cotra where like she look at like a few of these like biologically inspired um um quantities and I think that one that has like a some hold in my thinking uh on the on the upper end is like the the amount of computation that was used to essentially run Evolution and give birth uh to uh the human species which like uh C estimates to be like around like 10 to a 40 flop what you will need to rerun Evolution there's lots of Cavs going into that also like if you if you account for like oh we might have gotten lucky with like uh this this run but maybe it could have taken like much longer like maybe a more uh conservative estimate could be like even up to the 10 to the 40 two or even 43 uh flop what you will need to recreate human evolution or or so um and that feels to me kind of like this okay that that's that's the frontier that's like a like if we had that amount of compute then it's no longer about compute it's more about like are we uh do we have like the necessary techniques uh to use it uh productively to uh to create um uh intelligence the noo and uh so so this is this is now like something that has like this kind of hold in my in my thinking about like well you know I don't have like a very great idea about like at which level of compute we will see like uh AI that uh that can participate as like a fellow worker uh in the in the economy but you know it's probably not 10 to 20 to 26 because we are pretty much already there like I don't think that this is right in the Horizon it's probably not a 10 to the 40 flop like that that seems like this is too much like if you had that much amount of computes you will be able to again like rerun Evolution you probably can do better than Evolution at like creating intelligence with with current techniques I will I will think so I will I think it's not crazy to argue that so then it's like well it's somewhere in between that and like where exactly at which order of magnitude I don't know like maybe my distribution looks like pretty uniform between 10 to 26 and like 10 to the 36 or so if I'm uniform say say instead of that I'm uniform just between 10 to 26 and 10 to the 40 floating Point operations to get you know AI That's smart enough to you know just do all the science technology instead of us most of that is higher than you know the 5 time 10^ 28 that we're going to use to train on all the publicly available data on the internet that's right does that suggest that language models scaling language models is not going to be the thing that gets us AGI so I think that people will become creative once uh data becomes a thought constraint so again like data right now I don't think is the thought constraint I think is is compute like the the data sets that people train these models on at least when training was happening publicly like it was it was like a process that uh it was trained on things like again like common craw or the pile which are like data sets that were put together by like software Engineers uh on on essentially there their free time like they were not like this very large like industry funded projects to get the the sets to an extent like I think that the the Paradigm is changing and know like open is investing a lot of resources in getting data especially for fine tuning and fighton purposes but overall for the for the pre-training it seems that companies have been able to get away with like you know just use the data that already exist and is uh easily available once it becomes the case once this ceases to become the case like there there uh there is this huge incentive to uh come up with like ways to increase the efficiency of of the data get ways to improve um how uh get ways to get more data out of like other places and it's interesting to think about like what these places uh might be so like uh one one thing that we're seeing now is uh people are training models that increasingly uh deal with more modalities right so uh GPT 40 for example right now is quite proficient at like parsing images and can also produce uh produce images together with uh the Del I'm not sure if it Del of it's native image generation well anyway um models right now are increasingly more multimodal and uh you could use data from other modalities to try to push back uh this uh this uh deadline of like how much data you can you have available for training now I don't think that this will actually lead to like if you just look at like image and video data like I don't think that this will be like a huge delay like maybe this bu you like a couple more years of like scaling maybe this wies you like an order of magnitude of compute um like essentially I think that this increases the amount of data you have by a factor of three and the amount of training that you can do increases quadratically with the with the amount of data you have so maybe this an order of magnitude of of scaling broadly um so what what do you do if like you have already trained like all Text data on images and video like what what else do you turn to and like one thing thing uh that's interesting to think about is uh the model outputs themselves and synthetic data so right now uh open AI if I recall correctly is producing in the order of um 100 billion uh tokens uh per day okay which uh roughly extrapolates to like uh 40 trillion tokens per year okay so 40 trillion 40 trillion tokens like you know uh that's substantial that's uh that's pretty high like if you you were doing that if you were to keep that up uh for 10 years then you will have produced like amount of data that's as large as the the size of the index web uh today and if uh if that data turns out to be useful useful for training then you might be able to use this in order to to continue uh scaling and uh there there you know it's it's not completely clear at this moment whether that's going to be useful so there have been like some studies where like if you train models on like Reg model outputs then like they they tend to there's this phenomena called Model collapse in which like the the quality of the model ends ends up degrading right and we don't have like a really good understanding of that just yet but again like this is the this is in a sense like the early days of dealing with this problem because it hasn't been that big of an issue yet once it becomes uh I I expect uh the forces that be to push really hard to figure out like how do we go past this so okay here's here's how I'm thinking about this M it seems like if if it's really the case that we're running out of data in 2028 or running out of data to train on and you know it doesn't look like we're having kind of AI that can really like you know just take over Science and Technology creation from us at that scale like there's this question people ask which is how many Big Ideas do we need to have before we get superh human Ai and if it's the case that like we're going to run out of just data to train on before we get there it seems like that puts a lower bound saying like we need at least one big idea until we get to you know super smart AI I'm wondering if this seems right to you or like even a useful way of thinking about things to some extent uh I think like I'm not sure how big of an idea is going to be because might be just like use synthetic data and it works and it's like well you know good idea it worked so so yeah I guess I want to talk a little bit more about what you can say just about a you know AGI AI that can take over just you know scientific technological progress from humans um like so I guess we don't know exactly what level of loss it's going to be at but is there something to say about you know is it like 5 years away or is it 50 years away and yeah let's start with that question of timelines yeah so uh maybe the the naive way that you can think about this is like well we were saying before like well it's probably not going to be like a 10 to 26 flop that seems too little uh it's probably it's probably not going to be like 10 to 40 flop that seems too much it's going to be like well you know it's going to be like somewhere in there uh you you get like the the required level of compute uh to make AI uh that can substitute for for humans in scientific Endeavors uh um a reality and then you just think like well you know like how fast are we going and like how how much do we expect compute go go and like right now like my picture like naively what I I will expect compute to to go like is like well it goes like this very very fast for like a few years perhaps until until the end of the until the end of the decade uh perhaps like a little bit a little bit more than that and then like well eventually it has to slow down and like how much it slows down it depends on like very complicated factors like on whether like for example we have already found like this successful application of AI that allow you to like increase the growth rate of the economy or whether like uh the the field kind of like stagnate like it's still growing still growing but you you are now uh bounded by like how fast the economy grows overall which right now is growing like 3% per year which know it's like nothing compared to 4ex uh uh per per year holding all of these things in mind like okay maybe we go like Forex until the end of the decade and then like by the end of the decade we have we're training something that's like 10 to 29 uh flop ORS so and then like if you if if if if if you keep going like I don't know maybe like something something reasonable to expect is that uh somewhere between like between 2030 and 20 and and 2050 you might uh you might cross like uh 10 or 10 or of magnitude of compute more or something like it starts becoming it starts becoming like quickly very complicated like I think that once you are like pass to the 10 to 36 flop per year you start getting into territory where like uh you might just melt the Earth uh in the process uh Ju Just because of the heat produced by like doing the training that's right yeah I mean I mean I wonder at this stage like uh it seems like like right now computation available seems to be like a pretty good proxy for AI capability because we can sort of hold fixed like this is this just this pool of high quality human data that we're drawing from but like if you're right that we're running out of this data in 2028 it seems like maybe at that point computation is just no longer going to be such a good proxy and maybe we're just going to have to think way more carefully about like um algorithmic improvements or you know how you can make use of various data like like like do you think that's right and do you think that that reduces the value of just these like compute forecasting estimate style things uh DVD like uh I think that actually again like right now my guess will be that data is not going to be the most determinant bottleneck and this is somewhat driven by this uh disbelief that people will make it work okay people will figure out like how we can use for example like synthetic data here like I don't know like one example that we have is recently there was this Alpha geometry paper in which like they use synthetic data to train like an an A system that was able to solve like a Olympic met Olympic uh geometry problems right um so and and generally in especially in in a scenarios in which like uh there exists the right answer like math or programming like it seems that one naive strategy you can do to generate more data is like okay you use like the latest generation of models to uh to try to generate solutions to problems and you train on the problems that that are right I guess so I next want to talk just a bit about Epoch AI as an organization AB so why does Epoch AI exist so um it it exists it was born out of my frustration okay so again like uh while I was doing my PhD on artificial intelligence I was uh I was like uh somewhat surprised that no one had yet done like a a very systematic analysis of like the different trends at that mother for like there was this post on compute uh from open in 2018 but very little beyond that which uh seemed wild to me given like the huge amount of Interest that AI was uh was was was creating at least around me and uh how important they think that the technology was going to be for the future so we started like this whole process of like let's systematically track like the the the the the things that go into developing a models and study study the trends and try to get a better picture a better evidence-based quantitative picture of what the future uh looks like and that's uh that's how EOG uh was was born as so when I think of epoch I think of it as sort of a mix of AI impacts and our worlden data do you think this is like a fair understanding of what you guys are um to some extent yes the uh this might be under selling also like the the amount of in uh indepth research specifically on AI that we do so like I think a lot of uh like I think a lot of our worlding data is like they they are like these creators of data that uh do not produce like a lot of original research themselves but instead like are compiling like the collective knowledge of humanity and this is very good and very very useful at ook instead like we we're creating the data s uh creating the data sets uh ourselves and like trying to generate like this original body of work uh that people are going to be able to use to inform decisions about AI regarding AI impacts like it's Al they are also like close analoges to what we're doing in terms of like trying to think quantitatively about the I think that in impacts maybe they rely more on surveys and they rely more on analogies with other Technologies in in order to inform uh what AI is doing in epok we're being a bit more directed and being like well no this is about Ai and AI is what we are going to be uh focusing on trying to understand uh keep up to date with what's happening in in the air world and uh the latest knowledge that has been produced and all of these Concepts like scaling laws and such and then like kind of like the Hope here is like like I see EP EPO work as having like these three uh work streams like one of them is like we collect uh this data the second one is like we analyze it and the third one is like we put all our research together for to paint like this quantitative pictures of the future of AI so there's this work that you guys put in to just have this quantitative picture of the future of AI like one way someone could think about this is like look the point of having the point of information the point of like knowing things is that it can change some decisions that someone can make and the more important the decision that you know you changed the more important it was to know the thing what like concretely do you know like what decisions people might be making differently based on what EPO AI puts out this is a this is an excellent question with gets at uh who are audiences and like who is changing their mind uh due to to work and I I think a really important uh lever here has been uh policymaking actually so in the last two years uh we have seen this uh this search and interest uh from uh governments around the world in governing these new AI Technologies right and uh in order to decide how to govern it uh they they they want to have like this uh in-depth understanding of like which levels can they which levels Drive uh development and how they can uh regulate them so comput is here like this very clear example right in which uh it's this uh is this this lever that turns out to be very important for a development uh it's also quantifiable it's something that you can exclude uh people people from and so it's like this very natural level lever uh for for governing so uh that the way that I think Epoch data is being used around the world is in order to inform like this conversations on like okay like um if we want to create compute based regulation like how do you decide like at which compute levels you're going to uh impose like certain requirements right like for example uh the the EXE the executive order on AI from the US uh imposes certain additional requirements on models that we trained on over like 10 to 26 flop and like I don't know exactly how they chose the number but like a big suspicion I have is that they looked at our data and they were like you know 10 to 26 is like something that no model has been trained on yet it's like close enough that in a year I possibly companies will be trying to train models models this B and this is like a way in which like I could see EPO data being useful in order to make this uh this important policy policy decisions more generally like I I I I'm hoping that uh right now there's many people who are like trying to thought thoughtfully plan for AI and the Advent of these new technologies and I will want them to I will want them to be better informed I will want them to like this make decisions that are based on facts rather than rather than like Vibes on what is happening right now in the field like this seems really important given given that this technology might be like the the next Industrial Revolution that that we live through sure so if if that's kind of what it what Epoch AI is for um I don't know you you've got a bunch of work but there are some you know not every possible question has been answered yet so I'm wondering like what are the big open questions that you like most want to address that like don't have answers yet okay so big open questions at like one key thing that we are constantly thinking about is like when are we going to reach like different levels of of Automation and like how fast is it going to be the transition from a world with little automation to Award with a lot that seems to be like a very relevant uh input into many decisions going today about like oh should we uh should we try to plan for like this uh period of like very fast automation like trying to prepare for that should this be like something that is this something that's going to happen in the next year is this something that's going to happen in the next 10 years like uh whether different policies and plans for this technology like are actually feasible depends a lot about like when this this rapid period of automation begins and like how long the period itself uh it's it's going to be so this is something that we think about uh quite a bit h a second important part here is uh aside from keep aside from keeping track of like what's driving Innovation like we want to have like this in-depth understanding of like these factors we talked about before about like whether algorithmic innovation has this important component that's driv by compute and we talked about like oh why this will be relevant for uh for painting a p picture on the future of AI so we think a lot about those kind of things and also more generally about like the different bottlenecks that uh that AI might face in the future we' talked about data already other things that we're talking about is so for example one uh Sunny thing I've been thinking about uh lately is uh latency like the the in order to train a model you need to do like a certain amount of Serial operations and this to an extent limits like the largest training run uh that that you can do uh we we've also started thinking recently more seriously about power and about how much energy you will need in order to train these large machine machine learning models more generally like we want to be able to to examine critically like all of these reasons why uh the the current path the current trajectory of AI might slow down and trying to incorporate that in our thinking about like when we will reach like different levels of of Automation and finally like we want to think about like how this will uh impact Society so uh there is uh like in econom economists have been thinking long about the the effects of of automation but uh I'm I'm like somewhat even disappointed that uh so far there has been very little uptake among on mainstream Economist and trying to think about the the consequences of AI and of having like a a way of turning computers into workers uh essentially like I think that there's lots of things that uh that classical models of economy like Sous growth theory has to tell us have have to tell us about uh what effects AI might have on the world and very little work in like just straightforwardly trying to apply like this uh this already welldeveloped uh well understood models to the case uh of AI so this is something that I will be hoping uh to see more of uh in the future we do a fair bit of this uh within within EO but I will I would love for like uh mainstream Economist to also like join the join the boat and like try to drive the knowledge forward with this yeah it's weird yeah I I also find it weird how little there is in mainstream economics on this and also I know I think I want I'm not sure I want to name names especially because I haven't read the relevant piece but I think there are prominent instances this that just just do not seem very high quality to me um but so actually related to this um so I read uh your guys 2023 like annual update or something and one thing you said was that you would have this report on AI and economic growth at some point during 2024 and I'm I'm pretty excited for that when should I expect to be able to read that absolutely so um we we already put out a a report on on economic growth uh last year where we talked about like why AI might lead to explosive growth like what would be the the econ literate Arguments for like why you might see explosive growth from Ai and also like why are what are what are like the most plausible objections that we find uh to it sure but the next level for us that that was more of like this theoretical exercise of walking through like class this models of like economy and like this uh high level considerations for that but the level next next level for us is like trying to build this uh comprehensive integrated assessment model of the future of of AI that uh that uh tries to uh that tries to tie together what we know about compute what we know about the scaling law with like these uh this uh models of like uh uh the the the conomy and scientific and scientific progress and the Hope here is that in the end we will have uh like a a tool that is really helpful for like describing real if not realistic like at least illustrative pictures of what the future trajectory of AI might look like now when this is going to be out like we have an internal version it that works and it it give it I find it like very insightful but it's it's a very large uh it's a very large uh body of work like this is a a very large model that hasn't been thoroughly vetted and we want to be careful about uh putting out there things that we're not confident in so I I think it's probably going to be like at least half a year more before we're ready to share it okay so so there's some possib that we that we maybe get it by Christmas there's some possibility of that yeah all right okay I love timelines forecasting um a line through many of the things that you said were important open questions was just understanding the impacts of AI and to me this is like there's just this key question of like okay you can train an AI to a certain loss what does that mean you know so I mean Epoch has done some work on this just uh in this like direct approach blog post I'm wondering what should people read like what should people look at to get you know a good sense of what does loss mean yeah this is a this is a good question I think that right now like I feel I don't have like a good answer to that so uh things that hav been happened internally ATO for us to try to Grapple B better with this question uh last year we put out a report on uh challenges to uh to for assessing uh automation that my colleague David Owen uh put out where like he looked at like uh different work that has been done on like uh trying assess like the impact that different a technologies have had uh on on on on on tasks that are economically useful and trying to see if there was like a pattern to like which task are easier to automate like that would be the holy gal for us like uh having this this way of like trying to to order like all the tasks that are useful and say like well these are more automatable or like these are less automatable like having that kind of notion will be very useful to to figure out how a automation is going to FL it in the future sadly the conclusion of that paper is like you know work so far like uh it hasn't been that good uh it disag like every single piece that's out there like disagrees with every everybody else and we're just basically very confused about like how to think about automatability and like uh how we how we will reach uh when we will reach different levels of of capability in an economically useful uh sense uh one thing that uh we have started doing uh recently more uh with ook is is our own benchmarking program to try to get like a better sense of like uh how fast uh how fast a progress is happening in like different fields and uh trying to get us trying to get a better sense of like well you know if you scale these models like what should I expect like a what should I expect a 10 to 28 flop model to be able to do like this is to me still like this uh huge open question where like I I I don't think anyone has like a really good answer to that just yet this is the AI exess research podcast a bunch of listeners are like really concerned about exis from Ai and I think probably a lot of people are concerned that like if we make really good really smart AI that might pose an existential risk you're you know Epoch AI in general is a bunch of really smart researchers trying to understand Trends in AI Trends and AI progress um do you think the outputs of your research are going to tell people how to do a good job at making Ai and if so should people be worried about that um uh sorry the question is whether the outputs of our work are going to help people um whether this is going to like Advance have fast AI is being made yep that's right that's right uh so I think that to an extent uh this is true like having like a better understanding of AI like naturally uh it helps uh people uh build uh better AI now uh I don't I think that a lot of the work that we do is things that are already internally known uh within companies and I I don't imagine that the work that we're doing is going to be is being like massively critical uh for what uh what's what's been happening uh at uh at that at that scale now uh this is like a this this is like a hard question you need to Grapple with which is that in the end like your your your work is going to be used in like a multitude of ways and you don't have like control over that and you need to to to be thoughtful about like whether you want to take that trade-off and say like okay we're doing this this is this this might make uh this might make uh ai go faster or like this might make like third and ders of applications of AI uh more more likely but then the trade-off is that everyone is going to be like better informed and we're going to be like better prepared uh to deal with that situation it's hard to say like uh also like one thing that I will say is that uh it's also hard to it's also hard to give an answer to like how fast AI should be going right like uh there is a world in which like uh you H you want to slow it down and you want to like just like have a lot of time to think carefully about how is going but there might be like also word in which like you want to go you want to go quickly over like over like a period over like a you want to advance quickly up to the point where you have ai that's going to be really helpful for you uh to try to uh improve uh the the way that uh uh that we align the systems and we try to to help them uh do do our b or or SS right now I'm just just very confused uh so one thing I've been thinking a lot about over the last year is um risk evaluation in different context and like trying to think through like different uh different risks like I think that uh the risk on loss of control that one is more complex to think about uh but for the others like actually right now like I've been pretty surprised with like the with like the uh the uptake of the uh like like how things have played out uh so far like the government seems to be doing like mostly sensible things uh with like some cabit but uh the there has been like very reasonable response from thoughtful people to try to anticipate like what's going to happen with AI like what are the risks uh that that are likely to happen in the in the next uh couple of years and like trying to and trying to get ready to uh to act if like something unexpected uh something unexpected happens um so in in that sense like I I think I've become like well you know seems good for societ seems good for society uh right now in terms of like in terms of risk management like people seem to be doing like the the what will be what what will be necessary to manage at least the shortterm risks about thei on longterm risk is like it's so hard to think about and you know so H some some days I wake up thinking like yeah maybe having more time and going slowly is going to be better for society and other times I'm like no h actually like this is a risk that we should take we should go a bit faster or like even even like if we were right now uh we might want to go uh to get up capabilities like uh sufficiently high that we can uh use it to speed up solving this problem unlike anything everything else we deal with on on this Nar question of just like does is Epoch figuring out any stuff that um big Labs aren't one thing that is pcking for me here is so so when Phil tetlock so Phil tetlock is this academic um who studies forecasting um and basically study hey what if like people actually try to be good at it like I don't know he he gives a bunch more details but as far as I can tell the like key Criterion is just like are you actually trying to be good at forecasting and then everything sort of flows out from that basically the result is that if if people like actually focus on forecasting and be get really good at trying they can do like in in forecasting geopolitical questions they can do like similarly or maybe better I forget which than intelligence analysts at you know intelligence agencies that you know have access to classified details and stuff does that suggest that Epoch AI actually is in a position to be better at figuring out AI Trends than to some to some extent I think that that this is true like uh especially like uh we I I will even argue that we have one advantage which is that we we're focused on the forest whereas companies are just focused on on the trees right just having details about the trees is very useful I would love to have more details about the trees uh but uh if if you are the one who's like looking at the big picture and like putting everything together that gives you like this unique uh vantage point that others may not have yeah thanks for working in our beautiful backgrounds of our set I believe you're the first guest to do so um yeah uh so I guess uh kind of wrapping up are there any questions that I really should have asked but have not I'm trying to think how how how how respectable how respectably do want this podcast to end no I think that we have covered the basics and everything that we have covered was like really good thank you thank you so much for for the podcast that was really fun well well before before we do close um you know people have listened to this if people are interested in following your research um and you know your your work how should they do that well uh first of all I will welcome everyone to uh just go right now to their navigation bar and enter epi.org and just interact with our website we have lots of resources I'm sure you're going to be uh you're going to find like very useful you already mentioned at the beginning of the podcast our Trend dashboard that's where I will start where you have like all these really important numbers about how fast AI has been developed over the the last decade or so and then uh then we also have like our databases where you're going to be able to see like all these uh all all the data points that make up uh the trends that we've been studying and I will help ground your intuitions about like the the massive scale of these of these models and of course like all the research that we put out is publicly in our website together with every paper we we we try to release an a a comp a companion blog post which is aimed as at less technical audience and it helps summarize like the main insights of we what we found out through our research so I recommend that people just uh spend some time going over over pages I think if you're interested in AI you will find it like a very rewarding experience other than that uh uh apoc AI research is on uh on Twitter and you can uh follow us uh I'm also on Twitter myself and somewhat active so if you want to hear more about my my hot blistering takes about AI like I I will welcome you to follow me sure how uh what are what's Epoch ai's handle on Twitter and what's your handle on Twitter uh epok AI research that's the handle for epok and uh mine is is uh J M okay great well yeah thanks for coming in and thanks for chatting this has been really fun yes I can say the same thank you so much Danel this episode is edited by Jack Garrett an AB on a helped with transcription the opening and closing themes are also by Jack Garrett filming occurred at far Labs financial support for this episode was provided by the long-term future fund along with patrons such as Alexi maaf to read a transcript of this episode or to learn how to support the podcast yourself you can visit xrp . net finally if you have any feedback about this podcast you can email me feedback axr p.net [Music] [Music]

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.