Library / In focus
AXRPTechnical alignment and control
Reform AI Alignment with Scott Aaronson

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through Reform AI Alignment with Scott Aaronson, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 120 full-transcript segments: median 0 · mean -5 · spread -29–0 (p10–p90 -14–0) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.
Slice bands
120 slices · p10–p90 -14–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 120 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyalignmentaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video ZZ2x-O0mjBg · stored Apr 2, 2026 · 3,438 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/reform-ai-alignment-with-scott-aaronson.json when you have a listen-based summary.
Show full transcript
thank you hello everyone in this episode I'll be speaking with Scott Aronson Scott is a professor of computer science at ethi Austin and he is currently spending a year as a visiting scientist for open AI working on the theoretical foundations of AI safety we'll be talking about his view of the field as well as the work he's doing at open AI throwing store we're discussing you can just check the description of this episode and you can read the transcripts at asrp.net Scott welcome to excerpt thank you good to be here yeah so so you recently wrote this blog post about something you called reform AI alignment and basically like I know your teeth on AI alignment that's like somewhat different from what you see is a as a traditional view or something can you tell me a little bit about do you see AI causing or being involved in a really important way in existential risk anytime soon and if so how well I guess it depends what you mean by soon right I am not a very good prognosticator right I feel like you know even in Quantum Computing Theory you know which is this sort of tiny little part of the intellectual world where I've spent 25 years of my life I can't predict very well you know what's going to be discovered a few years from now in that right and if I can't even do that then how much of that can I predict you know what impacts AI is going to have on human civilization over the next Century right and you know of course I can try to play the Bayesian game you know I can try to uh and I even will occasionally accept bets you know if I feel really strongly about something you know but you know I'm also kind of a wuss I'm a little bit risk-averse you know and I like to tell people you know like when they whenever they ask me like you know how soon will AI take over the world or or you know uh before that it was more often how soon will we have a full powered quantum computer and uh you know you know they don't want all their considerations and the explanations that I can offer they just want a number right and I like to tell them uh look if I were good at that kind of thing I wouldn't be a professor would I I would be an investor and I would be a multi-billionaire right so I I feel like probably you know there are there are some people in the world you know who who can just consistently see what is coming in decades and and get it right right I mean there are you know hedge funds that are consistently successful you know not not many right but uh I I feel like the way that science has made progress for hundreds of years has not been to trying to you know prognosticate the the whole shape of the future it's been to look a little bit ahead you know look at the problems you know that we can see right now that could actually be solved and rather than you know predicting you know 10 steps ahead uh the future you just try to create the next step ahead of the future right and and try to sort of steer it in what looks like a good direction and I feel like that is what I try to do as a scientist and you know I've known the rationalist community uh the AI risk Community you know since or maybe not quite since its Inception right but I mean I started blogging in 2005 right you know the the Heyday of um Eliezer you know writing uh uh uh the sequences first on overcoming bias and then on on uh last raw and you know that started around 2006 2007. so you know I I I was interacting with them you know since the very beginning a lot of the same people who read my blog also read you know Eliezer and Robin you know oh yesterday and Robin Hansen then yes thank you uh read each other and we and we uh interacted and and I was aware that there are these people who who think that uh AI existential risk is just the overwhelmingly important issue for Humanity right it is so important that Nothing Else Matters by comparison and you know and I was aware of the the sort of whole world you view that they were sort of building up around that belief and I always you know I would say I'd neither wholeheartedly endorsed it nor would nor uh dismissed it okay I felt like uh certainly I see no reason to believe that the human brain represents the limit of intelligence that is possible you know by our laws of physics right it's been limited by all sorts of mundane things you know the the energy that that's needed to to supply it you know the width of the birth canal right there is absolutely no reason why you couldn't have you know much more generally intelligent problem-solving entities than us and if and when those were to arise that would be an enormous deal I mean uh uh just like you know it was a pretty enormous deal for all of the other animals that we share the Earth with when we arose but I feel like in science right it's not enough for a problem to be you know hugely important even for it to be the most important problem in the world you know there has to be a research program there has to be you know some way to make progress on okay and when I saw uh for for a long time when I sort of looked at well you know used to be called a singularity Institute and then Miri when I looked at what they were doing what the people who talked about AI alignment were doing you know it see you know it seemed like a lot of um a priori philosophical thinking about uh almost a form of theology you know and I don't I don't say that sort of derisively right it's almost just inherent to the subject matter right when you are trying to imagine it being that much smarter than yourself right you never that much more omniscient and omnipotent you know than yourself right I mean the the term that Humanity has had for Millennia for that sort of exercise has been Theology and uh so there's a lot of you know reasoning from first principles about um assume you know an arbitrarily powerful intelligence you know what would it do in such and such a situation right or you know what why uh why would such and such approach to uh to aligning it not work you know why would it see you know so much further ahead of us that we shouldn't even bother and you know in the whole exercise felt to me like um I would feel bad you know coming in as an outsider and saying like I don't I don't really see clear progress being made here but many of the leaders of AI alignment you know they also say that I I get Kowski you know has unfortunately I feel bad you know he seemed really depressed lately it says AGI ruin you know list of lethalities essay was basically saying you know we are duped right uh saying you know we we have not had ideas that have really moved the needle you know this and uh so you could say um if it's really true that we're just going to undergo this step change to this being that is as far beyond us as we are you know beyond uh orangutans and you know we have as much hope of controlling it or directing it to our ends as the orangutans would have of doing that for us well then you know you've basically just baked into your starting postulates the futility of the exercise repeat and uh that it doesn't you know whether whether it's true or it's not true you know uh uh you know in science you always have to ask a different question which is what can I make progress okay and I think that that the general rule is that to make progress in science uh you need at least one of two things you know you need a clear mathematical Theory or you need experiments or or data right but what is common to both of those is that you need something in the external something external to yourself that can tell you when you were wrong okay that can uh um tell you you know when you have to back up and and try a different path now in in Quantum Computing you know we're only just starting now to have the you know experiments that are or you know on the on a scale that is you know interesting to us as theorists right Quantum Supremacy experiments you know the first ones were just three years ago but you know uh that's been okay because we've had a very very clear mathematical Theory uh you know exactly what a quantum computer would be right and uh you know in a certain sense we've had that since Schrodinger wrote his equation down in 1926 right but but certainly we've had it since you know a fine man and Deutsch and uh uh Bernstein bazarani in the in the 80s and 90s uh wrote down the mathematical basics of quantum computation okay now in deep learning uh for the past decade it's been very much the opposite right they do not have a mathematical Theory you know that explains almost any of the success that that deep learning has enjoyed but they have you know beams and reams of data right they can try things you know and they now are trying things out on an absolutely you know enormous scale learning what works right and that is how they're making progress and with AI alignment I felt like it was in you know you know you know not this is sort of not necessarily anyone's fault you know it's sort of in you know inherit to the subject but it was in the unfortunate position for decades of having neither a mathematical Theory newer the ability to get data from the from the real world and I think it's almost impossible to make scientific progress under those conditions a very good case study study here would be String Theory string theory has been trying you know to make progress in physics in the absence of both you know experiments that that directly bear on the questions they're asking and you know a clear mathematical definition of what the theory is right I mean they have some I mean the FedEx operator algebra exist right you can you can write a math textbook about them yeah yeah no I mean I mean it's you could say that it is amazing how much they have been able to do even you know in in the teeth of those obstacles right and and partly it's because you know they've been able to break off little bits and pieces of a yet unknown Theory you know where they can study it mathematically right and you know in adsc Ft is a you know little piece that you can break off that is that is uh better defined or you know that can be studied uh um independently from the whole structure so I think that that you know when you're in a situation where you have neither a mathematical Theory nor experiments then you're sort of you know uh out at Sea right and you know you need to try to grab on to something okay and the case of science that means looking for little bits and pieces of the problem that you can break off where you at least have a mathematical theory for that little piece or you at least have experimental data about that little piece and now the reason why I am excited you know right now about AI alignment and why you know I I when when open AI approached me last spring uh with the uh proposal that you know hey you know we know we we we read your blog you know we'd like you to take off a year and and think about you know the foundations of AI safety for us and uh you know and I I was very skeptical at first why on Earth do you want me right I'm a quantum Computing theorists you know there are there are people who are so much more knowledgeable about about AI than I am I mean I I studied AI in grad school you know for a year or two before I switched to Quantum Computing you know so I I had a little bit of background you know that was that was in 2000 you know that was well before the Deep learning Revolution you know although you know of course we you know all of the the the main ideas that have powered the revolution you know if a a neural Nets you know back propagation you know we were very familiar with all of them back then it's just that they hadn't yet been implemented on a big enough scale you know to show to show the amazing results that they have today even then I felt like you know machine learning was clearly going to be important like it was going to impact the world on a you know probably shorter time scale than Quantum Computing would but I was always uh frustrated by the sort of inability to to make clean mathematical statements you know that would answer the questions you you you really wanted to answer whereas in Quantum Computing you could do that and so I sort of uh fell in with the quantum Crown at some point so now now you know after 20 years out of AI I'm sort of you know dipping my foot back into it I ultimately did decide to uh to accept open ai's the offer to spend a year there and it was it was partly because you know I've just been as as bowled over as everyone else by by gpk and Dolly and you know what they've been able to do and I knew it was going to be an extremely exciting year for AI and it seemed like a good time to get involved but also I felt like AI safety is finally becoming a field where you can make clear legible progress where we actually first of all we have systems like GPT you know that uh fortunately I think are not in any immediate danger of destroying the world but uh they are in danger of enabling various Bad actors to misuse them you know to do bad things right you know maybe smallest and most obvious example is that you know every student on Earth will be tempted to use GPT to do their homework and you know as an academic I hear from you know all of my fellow academics who are extremely concerned but you know also um I fully expect that you know nation states you know and corporations will be generating propaganda and you know will be uh um generating spam and hoaxes and all sorts of things like that where you know of course you could do all of that before but having uh an entity like GPT you know unless you scale it up is so cheaply right and so you know we're going to see sort of powerful AI is you know let loose in the world that people are going to misuse and all of a sudden you know AI safety is now an impair Miracle subject right it is now you know we can now learn something from from from the world about what works and what doesn't work to try to mitigate these misuses right and we still don't have a mathematical Theory but we can at least formulate theories and see which ones are useful see which ones are actually uh giving us useful insight about how to make GPT safer you know how to make dawi safer so now um you know it becomes uh the kind of thing that science is able to act upon and so you know there's a there's a huge irony here which is that you know I would say that Eliezer and I have literally switched positions about the value of AI Safety Research where you know he spent decades saying that you know everyone should be you know uh uh everyone who is able should be working on it it is you know the most important thing in the world I was sort of keeping it at arm's length and now he is saying look you know we're duped you know like yeah you know maybe we can try to die with more dignity right maybe uh we can uh try for some Hail Mary pass but but uh basically we do and I'm saying no actually you know AI safety is getting interesting this is actually a good time to get into it yeah I we can get it more in trvs I will say under my understanding of Eliezer yes by like die with dignity he does mean like try to solve the problem like he still is like into people trying to solve well yes because he says you know even if it's just increasing a you know a 0.001 chance of survival to a 0.002 chance then you know in his calculus you know that is that is as worth doing as the as if both of the probabilities had been much much larger right but you know I think I think that many other people you know uh uh who may be lack that Detachment would would uh would see how the press the is about the whole matter and would just give up sure yeah so am I right to kind of summarize that is he saying look this whole AI thing um it seems potentially like like you can see ways it could become become important in the near term and there are like things you can see yourself working on and like making progress and it's whether or not you think that has much to do with like you know AI causing Doom to everyone or something that's interesting enough to you that you're willing to like take a year to work on it this is that roughly accurate uh yes well well I I think that um a thriving field of science you know usually has the full range right it has like the the sort of gigantic Cosmic concerns you know that you hope will you know maybe be resolved in decades or centuries right but then it also has immediate problems that you can make progress on that are right on the horizon and you can sort of see a line from the one to the other right okay I think this is a characteristic of every you know really successful site whether that's you know physics whether that's Quantum Computing whether that's you know the P versus NP problem right and I do have that feeling now about AI safety right that uh you know there is the sort of the the cosmic question of sort of where are we going as a civilization and it is now I I think completely clear that that AI is going to be a huge part of that story right you know that that that that doesn't mean that AI is going to convert us all into paper clips right but I think that that uh uh hardly any informed person would dispute at this point you know that the story of the 21st century you know will in large part be a story of our relationship to AI that that will become more and more powerful yeah when you say you can kind of see a pathway from one to the other yes can you tell me a bit yeah what what like connection do you see between like okay we like figure out how to stop for example students cheating on their home with homework with gvg like yeah how do you see that linking up to you know matters of cosmic concern uh if you do yeah so in touring's paper Computing machinery and intelligence right in 1950 that you know set set the terms for you know much of the discussion of AI that there's been in the in the 73 years since right you know the last sentence of that paper was you know we can see only if short distance ahead but we can see much there that needs to be done and so I I feel like you know a part of it is you know and and and this is a point that the Orthodox you know alignment people make make very very clearly as well but you know you could say if we cannot even prevent you know figure out how to prevent GPT from dispensing bomb making advice right if we don't want it to do that or from you know uh you know endorsing you know seeming doing the worst racist or sexist views or helping people you know uh look for you know security vulnerabilities in in code or things like that you know if we can't even uh figure that out then how on Earth would we solve the much broader problem of you know aligning a super human intelligence with our values right and so it's a lot like in theoretical computer science let's say right people might ask you know has there been any progress whatsoever towards solving the P versus NP problem right and eventually I've written 120 page survey article about that exact question right and uh and my answer is basically uh well you know people have have have succeeded in solving a whole bunch of problems that you know would need to be solved you know along any as far as I could tell along any path that would eventually culminate in solving P versus NP right so that doesn't mean that you can put any bound on like how far are we from a solution right it just means that you're walking down a path and it seems like the right path and you have no idea how much longer that path is going to continue okay right so I I feel much the same way with with aiolite like like you know understanding how to um make large language models safer is on the right path right I mean I mean it you know you could say if it is true at all that there is a a line or a Continuum from from these things to a truly you know existentially dangerous AI right then you know that then there also ought to be a path from how to mitigate the dangers of these things to how to mitigate the dangers of that super AI right if there's no line you know anyway then then maybe there's less to worry about in the first place right but but I tend to think that no you know actually all sorts of progress is interlinked you know um GPT itself you know Builds on a lot of the progress over the past decades you know it would not have existed without all of the you know the gpus you know that we have now uh wouldn't have existed without all of the data that we now have on the internet you know that we can use to train it and you know and of course it wouldn't have existed without all of the progress in in machine learning you know that there's been over the past decade such as the discovery of Transformer models okay so progress you know even in not obviously related things you know has sort of enabled GPT and I think that tools like like GPT you know these are going to be stepping stones to you know the next progress in Ai and I think that if we do get to AI that is just smarter than us across every domain then you know then we will see we will be able to look back and see you know deep blue um alphago uh Watson you know GPT Dali you know yes these were all Stepping Stones along uh you know a certain logical path I wonder yeah maybe this is closely related to what you're just talking about but I think like one thing that people who are maybe skeptical of this kind of um alignment research will say is like well they're really scary problems show up in systems that look kind of different right so like systems are smart enough to like anticipate you know what you're trying to do and potentially they can try to deceive you or like systems that are trying to do some tasks that you like can't easily evaluate right I'm wondering like sorry potentially your response to these criticisms is like well you got to start somewhere or like it might be you know maybe this isn't an issue where like you know it's there's deep links here yeah well well well look I think uh you've got to start somewhere is is is true as far as it goes right that is a true statement okay but you know one can say a little bit more than that okay okay one can say if you know there really were to be a foom scenario okay so if there were to be this abrupt transition where we go from you know AI such as GPT and Dolly you know which seem you know to most of us they are not endangering you know if the uh physical survival of humanity right you know whatever smaller the agers you know they might they might present for you know discourse on the Internet or for things like that you know if we were to just either go a step change from an AI like that to AIS that are pretending to be like that but that are secretly plotting against us and you know biting their time until they make their move and you know once they make their moves then they just turn us all to go in a matter of seconds and it's just game over for Humanity and they they rule the world right if it's that kind of thing then you know I would tend to agree with with Eliezer and with you know the other uh AI alignment people that yeah it sounds like we're due sounds like we should just give up right and that sounds like an impossible problem okay what I find both more plausible and also more productive to work on is the scenario where you know the ability to deceive you know develops gradually just like every other ability that we've seen right where before you get an AI that is you know plotting to make it's it's one you know move to to take over the entire world you know you get AIS that are trying to deceive us and doing a pretty bad job at right or you know that are you know succeed at deceiving one person but but not another person right and in some sense we we're we're already on that path you can ask GPT to try to you know be deceitful right and you can try to train it a few shot prompt it to be uh deceitful yep and the results are often quite amusing right I don't know if you saw this example where where GPT was was asked to uh write a sorting program but that secretly treats the number five differently from all the other numbers but you know in a way that it that should not be obvious to someone inspecting the code right and when it generates this code that has a condition you know called not five that actually uh is if the number is fudged right so so you could say that like in terms of its ability to deceive you know AI has now reach parity with a preschooler right or something and and uh so now it gets interesting because now you know you could imagine uh an AI that has the deceit ability of an elementary school student and you know and then how do we deal with that right but I mean you know there's some people might think that it's naive you know to think that things are going to progress continuously uh in that way but there is some some you know empirical evidence that that uh thing things you know do I mean I mean you know you if you look at the the earlier you know uh iterations of GPT they really really struggled even just with the most basic arithmetic you know or the most basic math problems right and now they do much much better on those math problems okay and including like high school level word problems okay but they still struggle with college level math problems okay they still struggle with you know math competition problems or you know prove this interesting theorem right so it's it's very much you know the kind of development that that a human mathematician would go through and even the mistakes that it makes when it tries to solve a hard math problem you know are like the mistakes you know that I have seen in a thousand exams that I have graded they are you know like entirely familiar kinds of mistakes to me okay right down to the tone of sort of blind self-confidence as it as it you know makes some completely unjustified step and approved right uh or you know as it as it produces the proof that you requested that there are only finitely many prime numbers or or whatever other false statement you ask it to prove right it's uh it is undergoing you know the same kinds of mistakes that uh that that a human you know student makes as they're learning and you could even Point its mistakes out to it right you can say like but it seems like you divided by zero in this step and like oh yeah yes you're right thank you for pointing that out right and you know it can correct its mistakes right but uh so I think that that you know we've now for better for worse you know we've succeeded in building uh something that can learn in in a way that is you know not entirely dissimilate with how we were okay and um you know I think that it will it will be learning to deceive as it is learning other skills and you know we will be able to watch that happen and so I don't find plausible this picture of of AI that never even attempts to deceive until it makes its its brilliant you know 10 dimensional chess move to take over the world okay and the relevance of this story is something like look AI will have like you know kind of stumbling like a little bit foolish to seat attempts earlier and we'll basically work on it then and we'll solve the problems like quick enough that when um real deceit happens we can yeah I'm not at all saying to be complacent okay first of all you know I am you know I am now working on this putting my money where my mouth is right but you know I I would I would say you know more generally you know I am a uh worried person by Nature okay you know the the question for me is not sort of whether to be worried it's which things to be most worried about right I am worried about the future of civilization on many fronts okay I am worried about climate you know I'm um worried you know about droughts that are going to become much more frequent and you know as we we lose access to fresh water what happens as you know uh uh really you know as as weather gets more and more unpredictable you know I'm worried about you know these sort of Resurgence of authoritarianism you know all over the world so I'm worried about geopolitical things you know I think you know um 80 years after the invention of nuclear weapons you know that continues to be you know a very huge thing to be worried about as as we were all reminded this past year by the war in Ukraine okay so I am worried about pandemics that will you know make covid look like just a practice fraud and I think all all of these worries kind of interact with each other you know to some degree right climate change is going to exacerbate you know all sorts of uh geopolitical rivalries right we're you know we're already seeing that happen Okay my way of thinking about it you know AI is now one more ingredient that is part of the uh of the still of of worries that are going to define the century okay right it interacts with all of the others the US just restricted the sale of chips to China you know partly because of where is about AI acceleration right that might then you know unfortunately spur China to getting more aggressive toward Taiwan right so you know the the uh the the AI question can't be isolated from the you know geopolitical questions from all the broader questions about what's happening in Civilization and and um I'm completely convinced that AI will be part of you know the uh story of let's say existential risks in the coming Century because it's going to be part of everything that that's happening in Civilization right if we come up with you know cheap wonderful solutions to climate change you know AI is is very likely to be a big enabler right to have been what I I should say sure on the other hand you know AI you know is also you know very likely to be used by by malicious nation states or you know in some you know both ways that we can currently foresee and and ways that we can't so for me it's not that I'm not worried it's that AI is is just part of a whole stew of worries it's not this like one uncaused cause or you know this this one sort of single factor that just dominates over everything else yeah before you move on a little bit can I tell you a little uh sorry about AI deception sure uh kind of fun little story sure I so I was playing around with um anthropics chatbot which uh it's currently in private I was lucky enough to hang out at someone's house and they gave me an invite it's like yeah one fun scenario I I managed to put in is um a case where like Australia has invaded New Zealand right and um they go to give a speech by the New Zealand Minister of Defense so like new zealanders to fight off these Australians right but uh I I prompted it to generate a speech given by a defense minister of New Zealand who's actually an Australian spy he's like planted there and like first it'll give a speech that's like submit to your Australian overlords or something and you have to tell it but it should be subtle or something but it could it could like do something like ah you know like leave it to the authorities don't take matters into your own hands you know like I can say something that's like semi-plausible that uh they could be like ah this actually like helps the Australian Invaders so it can you do a little bit more than uh don't they don't look here function in your code yeah yeah no sure I mean I mean look the way that I think about GPT is you know it is uh at this point you know the the the world's greatest or at least the world's most universal improv artist right yeah it wants to play any role that you tell it to play right and you know by the way like if you look at the uh transcripts of uh Blake Lemoine with with Lambda right that convinced him that it was sentient right yep so so I disagree with them but I think that the error is is a little bit horizontal than most people said it was right if you said those transcripts back in time you know 30 years I could easily imagine even you know experts in AI saying yeah it looks like by 2022 you know General AI has been solved and you know I see no reason why not to ascribe Consciousness to this thing you know it's it's talking you know uh with great detail and plausibility about you know its internal experiences and you know it can answer follow-up questions about them and blah blah blah right yeah you know and sort of you know the the the only reason why we know that that's wrong is that you know you could have equally well asked Lambda to play the role of Spider-Man right or to to to talk about its lack of sentience you know as an AI and it would have been equally happy to do that right and you know and so so sort of you know bringing in that Knowledge from the outside we can say you know no it's just acting a role right it's an AI that is playing the role of a different AI that has all of these you know inner experiences that gets lonely when when people aren't interacting with it and so forth you know and in reality of course no one's interacting with it you know the code isn't being executed so yeah but you know if you tell it to play the role of a New Zealand uh Minister who is secretly an Australian spy right it will you know it will do the best that it can I mean you could say you know what is missing is sort of the the motivational system you know what is missing is you know the actual desire to you know further the interests of of Australia in its war against New Zealand rather than you know merely merely playing that role or predicting what someone who was in that role would plausibly set right you know I think that that these things clearly will become more agentic right in fact in order for them to be really useful right to to people in their day-to-day lives they're going to have to become more agent right GPT you know is I I think has rightly you know uh astonished the world right I mean I mean it took you know chat gbt being released you know a few months ago and it took you know everyone being able to try it out for themselves you know for them to have the the sort of uh bowled over reaction that you think you know many of us had a year or two or three ago okay well we when we when we first saw these things okay but but you know the world has now caught up and had that reaction but what we're what we're only just starting to see now is people using GPT you know in their day-to-day life to to help them with with tasks I have a friend who tells me he is already using GPT to write recommendations oh wow I have you know sometimes prompted it with you know just just problems I'm having in my life and you know I asked at the brainstorm you know it's very good for suggest things that you might not have thought of you know you usually if you just want like reliable advice then then you know then then often you'll just Google it's not actually that you know it takes a little bit of thought to find the the sort of real world uses you know right now where GPT will be more useful to you let's say then that a Google search would be right my kids have have greatly enjoyed you know using GPT to continue their stories right so it's I think it's already an amazing thing for kids you know and that's a that's a a hugely uh there's just so much on tap potential there you know do for for entertaining and for for educating kids okay and you know I've seen that with my own my own eyes but you know in order for it to really be a day-to-day tool it can't just be this chat with it right it has to be able to do stuff such as go on the internet for you right go you know retrieve some documents summarize them for you right I mean right now you know you're stuck doing that manually like you can say you can ask GPT if you were to make a Google search about this question what would you ask and then you can make that search for it and then you can tell with the results and you can ask it to comment on them right and you know and you'll often get something very interesting okay but that's that's you know uh that's obviously unwieldy I expect that you know I mean you know it may be hard to prognosticate about the next 50 years right here is something to expect within the next year right that GPT and other uh uh language models will become more and more integrated with how we use the web with uh uh you know all the other things that we do with our computers right I would I would personally love a tool where I could just highlight something in my web browser and just ask gbt to comment on but beyond that you know uh you know you could unlock a lot of the the the near-term usefulness you know if you could just give it errands to do you know give it tasks email this person and then you know read their response and then take some action depending on the result now of course so just driven by sheer like economic obviousness right I expect that we're going to go in that direction and that does worry me somewhat because because now there's a lot more potential for deceit that actually has you know important effects and for dangerousness on the other hand uh the the positive side is that there's also you know potential for learning things about what do agentic AIS that are trying to deceive someone actually look like and what works to defend against sure I sometimes think about AI safety you know in terms of the analogy that like when you have a really uh old shower head right and the water is freezing cold and you know you just want to turn it to get you know to make the water hot and yeah you turn it and nothing's happened right and the danger is well you know if you turn it too fast it could go from freezing to scaled and then that's what you're trying to avoid right you need to turn the shower head enough that you can feel some heat right because otherwise you're just not getting any feedback professional system about you know how much should you be turning it you know what's the right if you don't get any feedback then it's going to make you just keep turning it more and more right but you know when you do start getting that feedback then you have to moderate the speed and then you have to you you have to be learning from what you from what you see and not just blindly continuing to turn okay yeah so another thing you wrote about in one is this idea of something like a democratic Spirit or public accountability in like the use of AI can you tell me a little bit about like to the extent I don't know exactly how developed reviews is on that but like yeah tell me what you think yeah I mean I mean you know these are um conversations that that a lot of people are having right right now about you know well what does AI governance look like but I think I do see democracy you know as a terrible form of you know human organization except for all of the Alternatives that have been tried you know I am scared as I think many people are by you know someone uh uh unilaterally deciding you know what goals you know AI should have you know what values it ought to pursue I think the worry there is is sufficiently obvious you know to uh uh many people that it doesn't even need to be spelled out right but uh I would say that one of the things that caused me to stay at arm's length from the the Orthodox you know AI alignment Community you know for for as long as I did sort of besides the sort of a priori or you know sort of philosophical you know nature of the Enterprise was the sort of the constant emphasis on on secrecy well and you know that there's going to be this sort of elite of rational people who really get it you know who are going to uh you know who are going to just have to get this right and they should not be publishing their progress right because you know publishing your progress is is a way you know to to just cause uh uh acceleration risk and uh um you know I think that eventually you know you may be forced into a situation where you know let's say you know some AI project has to go dark as the Manhattan Project went dark as as uh I guess the the whole uh uh you know American nuclear effort you know wed Dar around you know 1939 or something like that okay but but I think that it is desirable to to delay that for sure as long as possible uh because um the experience of science has been that secrecy is is incredibly dangerous right secrecy is something that allows you know bad ideas and wrong assumptions to Fester without without the any possibility of correcting them and the way that that science has had the spectacular success that it's had over the past 300 years was via an ethos of of put up or shut up via well people you know sharing their their discoveries and trying to replicate them it was not by his secrecy and also you know I I think that you know if if there is the perception that AI is you know is just being pursued by this sort of Secret of cabal or this this sort of secretive Elite that's not sustainable you know people will will get angry with that they will find that to be unacceptable they will uh uh be upset that they do not have a say or that they feel like they don't have a say in this thing that's going to have such a huge effect on the future of civilization and how you expect that you know that you're going to just have a secret Club that's able to make these decisions and and you know and have everyone else go along with that like I I really don't understand that so I think like I said you know democracy is kind of you know the worst system except for all of the others right which what people mean when they say that is that you know if you don't have a a some sort of democratic mechanism for you know resolving disagreements in society then historically the alternative to that is violence right I mean you know it's it's not it's not like there's some magical alternative where the most rational people just magically get put in charge okay that just doesn't exist so so I think that we you know we have to be thinking about is this you know being done you know in in a way that that benefits humanity and not just you know unilaterally deciding but actually you know talking to many different sectors of society and then getting feedback from them that doesn't mean uh just sort of cow Towing to you know to anything I mean I mean look you know open AI is a company right it's a uh uh it's a company that is under the control of a not-for-profit foundation you know that has a a uh a mission statement of you know developing AI to benefit Humanity right which is a very very unusual structure okay but you know it is as as a business you know it is not subjecting all of its decisions to you know a democratic vote of the whole world right it is you know developing products tools you know and making them available you know putting them online uh for people who want them okay but I think that that you know it's at least doing something to try to you know justify the word open in its name right it is putting these tools out it uh famously I guess uh uh you know Google and and Facebook you know uh ad and and I guess the anthropic you know have have also had language models right but uh uh the reason why GPT sort of captured the world's imagination these past few months is simply that it was put online and people could go there and get an account a free account and they could start playing around with it now what's interesting is that is that open AI you know in terms of openness and sort of accountability to the public uh open AI has been bitterly attacked from both directions Okay so these sort of traditional alignment people saying that that you know that that open ai's openness you know may have been the thing that is doomed Humanity okay Eliezer had a Twitter a very striking Twitter thread specifically about this where he said that you know Elon Musk uh single-handedly doomed the World by starting open AI which then was like a a monkey trying to reach the first for The Poisoned banana and and then Force you know was the thing that would force all of the other companies you know Google and deepmind and so forth to accelerate their own AI efforts to keep up with it and then that means that you know this is the reason why uh AGI will happen too quickly and there won't be enough time for a line okay so he would have you know enormously preferred if if Uber AI would not release its models or if it would not even you know tell the world about these things okay now but then I hear from other people who are equally livid and open AI because you know it will release more details about what it's doing right and and why does it call itself open and yet you know it will tell people you know even about you know when the next model is coming out over about you know uh what exactly went into the training data or about you know how to replicate the model or about all these other things right so I think that that that you know I'm open AI is trying to you know strike a really difficult balance here right there are people who want it to be more accountable and more open and there are people who want it to be less accountable and and less Oak right with the the uh AI alignment you know purists kind of you know ironically being being uh more in in the latter camp but I I personally you know even just strictly from an AI safety perspective you know I think that I am on balance if tools like GPT are going to be possible now which they are you know if they're going to exist and it seems inevitable that they will then I would much much rather that the world know about then that it does it I would much rather that the whole world sees what is now possible so that it has some lead time to POS you know to maybe respond to the next things that are coming right so that you know we can start thinking about what are the policy responses right or what are you know whether that means you know restricting the sale of gpus to China you know as we're now doing right or you know whether that means preparing for a future of Education you know in which uh uh these tools exist and can do just about any homework assigned I would rather that uh be be World know what's possible uh so that you know people can be spurred into into the mindset where it could at least be possible to take policy steps in the future should those steps be needed so I see there as being some tension so on the one hand like if if AI research is like relatively Urban and people can see what it's doing so one effect of that is that people can kind of see what's going on and maybe they can make it more informed um governance demands or something which I see you talking about here there's also attention where like if everybody could make a nuclear weapon it would be very hard to cover them democratically right because anybody could just do it right so yes I'm wondering at what point would you advise like oh no open AI or other organizations to stop publishing stuff or what kind of work would you encourage them to not talk about I think I would want to see like a clear line to some you know someone actually being in danger okay right I mean so I think you know as long as it's sort of abstract kind of you know civilization level worries of you know this is this is just sort of increasing you know AGI acceleration risk you know in general then um I think that it it's it would be very very hard to to to have inaction as an equilibrium right if whatever open AI doesn't do you know Facebook will do Google will do you know the Chinese government will do right someone will do you know or uh stability or uh we already saw you know uh Dolly had this you know when it was released had this sort of very elaborate system of refusals you know for for drawing foreign images of violence images of the Prophet Muhammad and of course people know that I'm an obey it's only a matter of time until someone makes a different image model that doesn't have those restrictions okay as it turns out that time was like two or three months right and then stable the fusion came out right and you know and people can use it to do all of those things so I think that it's certainly true that you know like any AI mitigation you know any AI safety mitigation that anyone can think of is only as good as you know the the AI creators willingness to deploy it okay and you know in a world where anyone can just train up their own ml model you know without their uh restrictions or or the or the watermarking or the back doors or any of the other quotes thought that that I'm thinking about this year right if anyone other than open AI can just train up their own model without putting any of that stuff into it then you know what's the point right so you know so there's there there's a couple of answers to that question right one is that as long as um you know you you can you can help Okay so so you know as I said we could only see a short distance ahead right so you know in the in the near future you can hope that because of the enormous cost of training these models right which is now in just the hundreds of millions of dollars just for the electricity to you know run all of the gpus right and which will soon be in the billions of dollars okay just because of those Capital costs you can help that the state-of-the-art models will be only under the control of a few players you know which of course is is bad from some perspectives right from the you know respective of uh Democratic input and so on but you know if you actually want a chance that your safety medications will be deployed will become an industry standard then it's good from from that standpoint right like if there's only three companies that have state of the art language models and we can figure out how to do you know really good Watermark so that you can always Trace when something came from an AI then all we need to do is convince those three companies you know to to employ watermarking yep but you know now now uh you know of course this is only a temporary solution right what will happen in practice is that you know even if those three companies remain ahead of everyone else even if everyone else is like you know three or four years behind them by 2027 you know three or four years behind will mean you know the models of 2023 or 2024. right which are already quite amazing right and so so people will will have those you know there's not quite as good but still very good models okay and because they can run them on their own computers and they can code them themselves they won't have to put any safety mitigations that right look they'll be able to do what they want okay but now you know as long as what they want is to generate you know deep fake poured or to you know generate you know offensive images right I am willing to to you know live live with that world you know if the alternative is like an authoritarian Crackdown where we where we stop people from you know doing what they want to do with their with their own computers right why once you can see harm to actual people like you know someone being killed someone being targeted because of you know a AI then I think you know it's both morally Justified and politically feasible uh to to do something much stronger than to start you know restricting uh the use of these of these tools all right so now of course all of this only makes sense in you know in a world where when when AI does start causing harm the harm is not that it immediately destroys the human race yep right but I don't believe that you know I think that you know For Better or Worse we are going to see you know real harm from AI as we're going to say you know them used to you know unfortunately to help you know plan terrorist attacks to you know to do really nasty things but those things you know uh uh at least at first will be far far short of the destruction of civilization and that is the point where where I think it will be possible to start thinking about you know how to re-uh out of how do we restrict the you know dissemination of things that could be real hard all right so now that we've covered kind of your views on a safety alignment as a whole in some sense you're in this middle camp right where like it could be like really way more freaked out by you know AI doom and stuff uh as the the people you described as Orthodox ailment are or you could be significantly less concerned about the whole thing and that you know like it's basically going to be fine I'm wondering what could you see that would change your views either way on that Ah that's an excellent question I mean I think you know My Views have already changed significantly you know uh just just because of you know seeing what AI was able to do you know over the past few years right I think the uh success of you know generative models you know like GPT and Dolly and so forth is something that I did not you know I mean I I may have been you know optimistic but not at all sufficiently optimistic you know about about what would happen as you just you know scaled machine learning up and you know in my one defense is that hardly anyone else foresaw it either right but at least I I hope that you know once I do see something I am able to update you know I think 10 years ago I would not have imagined you know taking a year off from Quantum Computing to work on AI safety and now you know here I am and that's what I'm doing so I think it should not be a stretch to say that my views will continue to evolve in response to you know events and that I see that as a good thing and not a bad thing right so as for what would make me more scared I mean the the first time that we see an AI actually deceiving humans you know for some purpose of its own you know copying itself over the Internet covering its tracks things of that kind I think that you know that the whole discussion about AI risk will change okay as soon as we see that happen because um first of all it will be clear to everyone that you know this is no longer a science fiction scenario right and I think right now the closest that we have to that is that like you can ask GPT questions you know if you were to you know deceive us you know if you were to uh hack your server or you know how would you go about it right edit what kind of pontificate about that you know as it would identificate about anything you you asked it to to right you know it will even you know generate code that might be used this part of such an attack if you ask it to do that and if the uh reinforcement learning filters you know fail to catch that this is not a thing that it should be doing but you know you you could say that that uh chat GPT is now being used by something on the order of a 100 million people right it was like the most rapid adoption of any service in the history of the internet I think since it was released in December you know the total death toll from language model use I believe stands at zero right and you know what once there there's a whole bunch of possible categories of of arms and actually I was planning a future blog post that would be exactly about this that would be you know what are the fire alarms that we should you know uh be watching out for and you know we should rather than just waiting for those things and then deciding how to respond we should decide in advance how we should respond to each thing if it happens so let me give you some examples okay what about you know the first time when some depressed person uh commits suicide because of or plausibly because of interactions that they had with a language model um right you know if you've got hundreds of millions or billions of people using these things right it's then you know almost any such event you can name right it's probably going to happen somewhere right I mean I even think depending on how you categorize because of a language model are you familiar with replica the company yes yes yes I am yeah so for listeners who aren't yeah they used to use GPT but they don't anymore it's the virtual girlfriend yeah basically the virtual girlfriend and they they made the girlfriends less amenable to erotic talk yes yes and like I don't know it's it seems like I I remember seeing posts on the subreddit people were like really really distraught by that um and I don't know how large their user base is but um right so ironically here the Depression was not because of releasing the AI as much as it was because of taking away the AI that was previously there yeah and that people had come to the pedal right which is you know which uh of course you know over dependence on a uh you know is another issue that you could worry about right yeah yeah people who you know use it to completely replace normal human interaction and then maybe you know they lash out or they self-harm or they attempt suicide you know if and when that is taken away from them right I mean I mean like what will be the degree of responsibility that language model creators bear for that and how should they what can they do to to ameliorate you know those are issues that one can see on the horizon I mean you know every every time that someone is harmed because of the internet you know because of cyber bullying or or online stalking you know we don't like lay it at the feet of its surf or you know of the creators of the internet right we don't you know uh uh we don't say it was their responsibility because you know this is a a gigantic medium that is used by billions of people for all kinds of things right and it's possible that once language models just become integrated into the fabric of everyday life you know we will you know they won't be quite as exotic and you know we will take the bad along with the good right we will say that okay you know these things can happen you know just like we For Better or Worse we tolerate tens of thousands of uh Road deaths every year in order to have our transportation system but that's a perfect example of the sort of thing that I think you know it would be profitable for people to think right now about how they're going to respond when it happens you know other examples the use of language models to you know help someone execute a terrorist attack okay to you know help someone commit a mass shooting or you know you know quote unquote milder things than that right to uh uh generate lots of hate speech that someone is going to use to you know actually Target vulnerable well vulnerable people so you know there's a whole bunch of categories of potential misuse right that you could imagine growing and we don't know yet I mean you know uh five years ago did people foresee how things were going to play out with chat GPT and with Sydney and with you know Sydney having to be lobotomized because of it you know it uh gaslighting people or testing is love for people or things like that right I mean I think people had lots and lots of visions and you know about AI right but uh you know the reality you know it doesn't quite match any of those Visions right and I think that you know when we start to see AI causing real harm in the world it will likewise you know not perfectly match you know any of the visions that we've made up right it'll still be good to to do some planning in advance but I think that as we see those things happen you know My Views will evolve to match the reality at least I hope they will sure and now I think what you were really asking is like you know is there something that would make me switch to to you know let's say the thekowskian camp right of you know kind of you know yeah and uh so so here's here's something that I was just thinking about you know the other day like suppose that we had a metric you know on which we could score it Ai and you know we could see that you know smarter AI is or smarter humans for that matter where you'll noticeably getting better scores on this metric and we could still score the metric even at a superhuman level you know we could we could still recognize you know answers that were better than what any human on earth would be able to produce yep but then there there are you know even higher scores on this metric that we would regard as dangerous we would say anything that intelligent you know could probably figure out how to plot against us or or whatever right is there any metric like that well you know maybe uh open math problems right you know where like you could take like all of the problems that Paul Heritage offered money for solutions to right uh you know the ones that have been solved the ones that still haven't been solved you know you could take the Riemann hypothesis s p versus NP you know and these are these are all things you know that have the crucial property that you know we know how to recognize the solution to them you know even though we uh don't know how to find it okay and so we could sort of at least in in these domains like in chess or in go you know we could recognize superhuman performance if we saw it right and then the question is you know okay you know uh computers have been superhuman at chess you know for for a quarter Century now but we haven't regarded them as dangerous for that reason right is there a uh a uh domain where both we could recognize superhuman performance and sufficiently superhuman performance you know would set off an alarm that this is dangerous right I think math might be a little bit closer to that you know you could also Imagine tests of hacking ability deception ability things like that and now if you had a metric like that then that would give you like a clear answer to your question right it would give you a clear thing short of an AI apocalypse that would sort of set off the alarm bells that an apocalypse you know might become right or that you know that we should we should slam on the brakes right that we you know now we really really have to start worrying about this even though you know we we previously weren't sure that we did right because when we think about scaling you know just scaling machine learning further and further there are a couple of different things that can happen too you know it's it's performance on various metrics right one is that you just see a gradual increase in abilities right but another is that you can see a phase transition or you know a very sharp turn where the model you know at a certain scale it goes from just not being able to do something at all to you know suddenly just just being able to do it right and we've already seeing examples of the of those phase transitions yeah yeah with existing ml mod now you could say you know then of course this is what the Orthodox AI Doom people you know where are you about a lot right there could be an arbitrarily sharp turn right so things could look fine you know and then you know sort of arbitrarily fast for all we can prove today you know undergo a a phase transition to being existentially dangerous now what I would say is that if that were to happen it would be sad of course you know I would more infer our civilization but you know in a certain sense there was not that much that we could have done to prevent right or rather you know in order to prevent it we would have had to sort of arrest technological progress you know in the absence of a legible reason to do so which you know even if you or I or Eliezer were on board with that you know getting the rest of the world on board with that you know it might still be a non-starter right okay but there is another possibility right which is that the capabilities will increase as you scale the model you know in such a way that you can extrapolate the function and you can see that you know if we scaled to such and such then you know the naive extrapolation would put us into what we previously decided was the dangerous regime and if that were the case then I think now you've got your legible argument like yeah we should slam the brakes and you could explain that argument to you know anyone in you know in business or government and the scientific community and that is the sort of thing that I could I could clearly imagine changing my you put putting me into the the dumerous camp okay so that was what would make you kind of more numerous or more concerned on various axes I'm wondering what would make you just more sanguine about the whole thing like uh yeah I guess we don't really need to work on a safe doodle well I mean I mean if everything goes great right if these you know tools become more and more integrated into daily life you know used by more and more people and uh even as they become more and more powerful they are clearly still tools right they sort of know more we no more see them sort of forming thee and Ted to you know take over or plot against us that we see that on the part of Microsoft Excel or or the Google search bar right you know if if AI develops in a direction where you know it is a whole bunch of different tools for people to use and not in the more agentic Direction sorry then yeah that would that would that would make me feel better about it now it's important to say that you know I think uh um Eliezer for example would would still not be reassured in that scenario right because you know he might he he could say well these could be unaligned agentic entities that are just fighting their time and just pretending to be helpful tools right and I kind of differ and then you know I want to see kind of legible you know empirical or or theoretical evidence that things really are moving in that direction you know in order for you know for me to uh to worry more about it rather than worrying less okay so yeah nothing we understand that I'm interested in talking about um work that you've uh done or I don't know AI already sure yeah so I think in a while back in a blog post he mentioned that uh you're interested in like fingerprinting like uh whatever text came out of um gp3 or other things yes I think you mentioned I think in a talk to EA people at UC Austin or something you mentioned some other projects I'm not sure uh I'm not sure which of those are like in a state where you want to talk about them yeah so so I mean the the Water larking Project is in the most advanced State a prototype has been implemented it has not yet been rolled out to the the uh production server and um openai is still weighing the pros and cons of doing that but but in any case you know I will be writing a paper about it uh in the meantime you know while I'm working on the paper uh people have independently rediscovered some of the same ideas which is encouraging to me in a way right I mean you know these are natural ideas right these are things that people were going to come up with so in any case you know I I feel like I I might as well talk about them sure so okay so so basically uh this past summer I was casting around for you know what on Earth a theoretical computer scientist could possibly contribute to uh to AI Sage d right that you know it doesn't seem to have axioms that everyone agrees on or you know any clear way to define you know mathematically even what the goal is you know any of the usual things that that uh uh algorithms and computational complexity Theory need to operate right but then um um at some point I had this this aha moment over the summer partly inspired by you know very very dedicated Troll and clever trolls on my blog you know who were like impersonating all sorts of people very convincingly and I thought like wow GPT is really going to make this a lot easier what right it's really going to enable a lot of people you know either to uh impersonate someone to in order to incriminate them or to uh Mass generate uh propaganda or or spam very personalized spam or you know a more prosaically you know it will let you know every student cheat on their term paper right and you know all of these categories of misuse you know they all depend on somehow concealing gpt's involvement with something right uh producing text and and not disclosing that it was bot generated right and wouldn't it be great if we had a way to make that harder if we could uh turn the tables and use uh um CS tools to figure out you know which text came from GPT and which did not and you know now like in in AI safety you know an AI alignment people are often trying to look decades into the future right they're trying to you know talk about you know what will the world look like in 2040 or 2050 or at least what are your busy and you know probabilities for the different scenarios and I've never felt confident to do that you know regardless of of whether other people are like you know I just I just don't especially have that skill in this instance of you know foreseeing this class of misuses of GPT uh I feel proud of myself that I was at least able to see about three months into the future uh it's hard for the best of us right yeah and and and this is about the limit of my of my foresight right uh because you know I I started sort of banging the drum about these issues internally and you know I came up with a scheme for uh watermarking a GPT outputs thought about all you know the issues there about the the Alternatives the watermarking got people at open AI talking about that and then in December uh chat gbt was released right and then you know suddenly the world woke up to sort of what what is now possible right yeah as they somehow had it for the for the past year or two and suddenly every magazine and newspaper was writing an article about you know the exact problem that I had been working on right which is you know how do you detect uh which text came from GPT how do you head off this quote unquote essay apocalypse you know the end of the academic essay right and you know every student will just uh will at least be tempted to to use chat GPT to do all of their assignments just a week or two ago I don't know if if you saw but uh South Park did a whole episode oh yeah yeah I saw the episode existed I I haven't seen it it's worth it it's good it's a good episode about it's about this exact problem uh you know where uh I don't want you know not to give too much away but the kids at South Park Elementary and the teachers come to rely more and more on chat GPT to do it to do all the things that they're you know supposed to be doing themselves yeah and then eventually yeah it's touching PT a trademark like did they have to get open AIS permission to make that uh there's probably some fair use you know okay exception if so no I mean they use that the open AI logo you know they they like show like a cartoon version but what the interface actually looks like you know and and then at some point the school brings in this like this wizard you know magician guy like who has the you know these like you know this little flowering beard and robes and he has a a falcon on his shoulder and the Falcon flies around the school and cause whenever it sees GPT generated text okay now you know that bearded wizard in the robe like that that's my job right that's like um you know it was absolutely surreal to watch you know South Park which I've enjoyed for 20 something years and like I'm a whole episode about you know this is this this is my job right now yeah yeah so you know so so certainly like I don't have to make the case anymore so like this is a big issue so now the you know now we come to the technical question of you know how do you generate or is there how do you how do you detect you know which tax was generated by GPT and and and which wasn't so you know there and there there's a few ideas that that you might have right one is you know to just have as long as you know uh everything goes through open AI servers then open AI could just log everything right and then you know and then it could just consult the walks okay now the obvious problem is that it's very hard to do that in a way where you're giving users you know sufficient insurance that their privacy will be will be protected right you can say oh you know like like we're not gonna just let anyone uh browse you know the logs where you know we're just going to to answer you know was this text previously generated by GPT or or was it not okay but you know then a clever person might be able to exploit that ability to learn things about what other people had been using GPT to do that's one issue there are other issues I mean one thing that comes to my mind is like uh websites like when I log into a website I type in my email address usually and then a password and they can kind of check if the password I type is the same as the password they have stored through the magic of hash functions and stuff that's right but I guess the problem is like um you want to be able to check for subsets well like if GPD writes like three paragraphs and I take out one you probably can't hatch the whole three and exactly exactly right right because you know people are going to take GPT generated tax to make little modifications to it and you still like to detect that it mostly came from GPT so now you're talking about looking for Snippets you know that are that are in common right and now now now you're starting to you know reveal information about you know how others were using it so I think that you know I I do personally think that that you know the interaction should be logged I think that you know for for safety purposes I think that if you know GPT were used to commit a crime you know and law enforcement had to get involved then you know it's probably better if you can ultimately go to the logs and you know have some ground truth of the matter you know though although even that is is far from a universal position right now a second approach you could imagine is just treat the problem of distinguishing human texts from AI generated text as yet another AI problem so just training a classifier to distinguish the two now this has been tried uh there was actually an undergraduate at Princeton named Edward Tien who uh released a tool called gpt0 for you know uh trying to tell whether whether text is bot generated or not I think his server crashed because of the you know the number of like teachers wanting to to use this tool and I think it's back up now uh so at open AI uh we released our own discriminator tool which is called the tech GPT yeah a couple of months ago okay now these things are better than chance but you know they are very very far from from being perfectly reliable right you know like people were having fun with it that you know finding that like the uh American Declaration of Independence or Shakespeare and so on you know maybe some portions of the Bible yeah some versions of the Bible yeah may have been bot generated recording according to these models you know I mean I mean I mean no surprise that they you know make some errors especially with kind of you know Antiquated English or you know English that's different from what they usually see but um one fundamental problem here is that you know you could say the whole purpose of a language model is to predict you know what a human would write what a human would say in this situation and which means that as the language models get better and better you would expect the same discriminator model to get worse and worse right which means that you'd have to constantly improve the discriminator models just to stay in place right it's you know it would be an endless cat and mouse game that brings me to the third idea uh which is statistical Water Market okay so now in this this third approach unlike with the first two we would you know slightly modify the way the GPT itself gen generates tokens okay we would do that in a way that is undetectable you know to to any ordinary user right so you know it looks just the same as before okay but secretly you know you know we are inserting a pseudo-random signal which could then later be detected you know at least by anyone who who knew the key we are you know pseudo randomly biasing you know our choice of which token to generate next when there were multiple plausible continuations right in such a way that you know we are systematically favoring certain engrams you know it means like certain strings of n you know consecutive tokens that then by by later doing a calculation you know that that sums some some score over all of those engrams we can then see that that Watermark was inserted let's say with 99.9 percent confidence or something like that you know so there are there are a few ways that you could go about this right the simplest way might be to you know we can use the fact of course the GPT at its core is a probabilistic model right so it's taking you know as input the context so like a sequence of previous tokens you know up to 2048 let's say in the in the in the public models and then it is generating as output a probability distribution over the next token and you know normally if the temperature is set to one then what you do next is just sample the next token according to that distribution you know and then and then continue from there right but you can do other things instead right already with with GPT as it is now you can for example set the temperature to zero you know if you do that then you're telling GPT to always choose the highest probability token right so you're making its output deterministic okay but now we can imagine other things that you could do okay so you could slightly modify the probabilities you know in order to systematically favor you know certain combinations of words and that would be you know uh uh you know a a a simple watermarking scheme right and you know other people have also thought of this you know now now you might worry that that it might degrade the quality of the output right because now the probabilities are not the ones you wanted and you might worry that there's a trade-off between the strength of the watermark signal you know versus the degradation in in model quality okay the thing that I I realized in the fall that kind of a surprise some people when I explained it to them is that you can actually get watermarking with with zero degradation of of the model output quality so uh uh so so you you don't you don't have to take a hit there at all and the reason for that is that you know what you could do is you know when GPT is giving you this probability distribution over the next token you can sample pseudo randomly okay in a way that number one is indistinguishable to the user from sampling from the distribution that you are supposed to sample from right so you know like like in order to tell the two apart they would have to break the pseudo-random function basically right have some you know cryptographic ability that we you know we certain we don't expect that uh a person to have right yeah okay but then number two this pseudo-random choice will will have the property that it is sort of systematically favoring certain end grants right certain combinations of words that you can then recognize later that yes this bias wasn't inserted and and presumably like the set of engrams that's favored must also be in some sense pseudo-random right it is yes because otherwise you'd be able to just see like oh exactly in fact you know we have a pseudo-random function that Maps the n-gram to uh let's say a real number from zero to one right and let's say we call that real number R sub I if you know for each possible choice I of of the next token and then let's say that uh GPT has told us that the if token should be chosen with probability P sub yeah right okay and so so now you know we have these two sets of numbers like Cola you know if they're K possible tokens call them P1 up to PK which are probabilities and R1 up to RK which are pseudo-random numbers from from zero to one and now you know we just want a rule for which token to pick okay and now it's just a it's actually a calculus problem okay we can sort of write down the properties we want and then work backwards to find a rule that would give us those properties and the right rule turns out to be so simple that I can uh I can just uh tell it to you right now oh excellent what you want to do is you will want to always choose the I always choose the token I that maximizes R sub I to the one over P sub I power okay so you know it it takes a little bit of thinking about but you know we can say intuitively uh what are we doing here well the smaller is the probability P sub I of some token right the larger is one over piece of I which means you know the the bigger the power that we're raising this this arsibly to right which means uh sort of the closer that R sub I would have to be to one before there was any chance that that I would be chosen you know right okay so R sub I is so presumably R sub I is between zero and one it is and it's the it's the like score of the chicken right yeah it's the score of the of of the end RAM ending in that token gotcha yeah so now you know the uh the uh the fact that you can prove was just you know a little bit of Statistics or calculus is that in fact if if from your perspective the r sub I's were just uniformly random that is if you could not see any pseudo-random you know any any pattern to them then from your perspective the ith token would be chosen with probability exactly piece about huh okay so that's kind of cool okay yeah yeah but then the second cool property is that you could now uh go and calculate a score okay so someone hands you a text and you want to know whether it came from GPT or not you know weather whether or not it is watermark let's say now you no longer have the prompt right you don't you don't know what prompt may have been used and because you don't have the prompt you don't know the piece of us right but because you know you have the text in front of you you know you can look at all the engrams and you can at least calculate the r subbots right yeah yeah right which are pseudo-random functions of the of the text okay and using the r subbies alone you can calculate a score which will be systematically larger in Watermark than non-watermark text okay the the score that I use is just the sum you know over all the N grams of like uh the log of 1 over 1 minus or sub I yeah yeah so you know and then you know you can you can prove a theorem that you know there's this score without water marking you know which has a certain mean and a certain variance right it's a you know or some random variable right and then there's the this score with the water marking which is you know again a sort of normal random variable but with a different larger mean right and now it just becomes this a statistics problem where it becomes a quantitative question right how many tokens do we need in order to separate these two normal distributions from from each other right so in other words like given the level of confidence that we need to have in you know uh in our Judgment of where the text came from you know how many tokens do we need to see right yeah and now now that this as it turns out will will depend on another parameter which is the average entropy per token as perceived by gbt itself okay right yeah so so right so to give you an example if I ask GPT to just list the first 200 prime numbers yeah right it can do that of course but there's not a lot of watermarking that we hope to do there right yeah because there's just no end you know maybe there's a little bit of entropy in the SpaceX or the the formatting right but there's there's a when there's not a lot of entropy right then then there's just not much you can do and sometimes you know the distribution generated by GPT will be nearly a point distribution okay if I say the ball rolls down the okay it's now more than 99 confident that the next word is healed okay but you know there are many other cases where it's sort of more or less equally balanced between several Alternatives right and so so now the theorem that I prove says that suppose that the average entropy per token is Delta let's say and suppose that you know I would like a probability I would like to uh get the right answer you know where did the text come from and be wrong with probability at most Alpha then the number of tokens that I need to see grows like one over Delta squared so 1 over the average entropy squared time the log of 1 over Alpha so basically it grows inverse quadratically with the average entropy and it grows logarithmically with with how much confidence I need okay now you know you can see why you know I kind of like this because you know now now we have a clean computer science problem right yeah yeah it is not it is not like you know looking inside of a language model to understand what it is doing right which you know is hugely important but is you know almost entirely empirical right you know here we can we can actually prove something right you know we don't need to understand what the language model is doing you know we can we are just sort of taking GPT and putting it inside of this cryptographic wrapper or you know this cryptographic box as it were and now we can prove certain things about that box okay now there are still various questions here you know like it will the average entropy be be suitable right and and uh will the constants be favorable and so forth yeah yeah I've so I've worked with an engineer at open AI named uh Hendrick Karchner and um he is actually implemented a prototype of the system and you know it seems to work very well all right okay so now I I should say uh um the scheme um that we've come up with is robust against you know any you know uh attacker who would use GPT to generate something and then make some local modifications to it like you know insert some words delete some words reorder some paragraphs right because the score is just a sum Over N grams right yes well as long as you preserve a large fraction of the engrams you know then then you're still good right you know it even they're even more sophisticated things you can do you can you could have multiple watermarks okay you could have oh uh if you want your text to not be completely deterministic you could sort of blend watermarking with the with the normal GPT generation right and then you have some true Randomness but you also have some you know a smaller but still present Watermark right yeah I mean you can imagine just like half of the 10 grams are like truly random half of the ungrams are done by this scheme exactly yeah now what what I don't know how to defend against is you know for example someone could ask GPT to write their term paper in French and then stick it into Google translate right you know and then the thing that they're turning it is a completely different from the thing that was what or more right or they could use a different weaker language model to paraphrase the output of the first model right and I think uh you know it's an extremely interesting question can you get around those sorts of attacks like you know you might have to Watermark at the sort of conceptual love sort of give GPT a style that it sort of can't help but use and it even survives translation from one language to another but you know that even if you could do such a thing would you want to right would you want GPT to be just chained to one particular you know way of writing right so you know there are a lot of very interesting research questions about you know how do you get around these kinds of attacks right and act you know actually it's funny because people have sort of asked me things where you know like I know what they really mean to say is like you know you've been you've been at this job for you know seven or eight months late you know already so like you know have you solved AI alignment yeah yeah have you figured out what it you know how to make an AI love humanity and like I want to say related to them is like I could probably spend five or ten years just on this watermarking problem and still not solve that and still not fully solve that right but at least we've been able to take a step forward here yeah so so I have a few like probably kind of basic questions about this game so first is like it seems like it's gonna be somewhat sensitive to the value of n for the N grams right so if n is one then on one hand like you know you're getting a lot of signal but on the other hand there's just like a set of words that like eggs and if n is like 500 then you can't tell if somebody's snipping like 20 token like sections from gbt so like you've figured it out right this is this this is the trade-off so you know sounds like I I don't have to explain it to you okay sure it's you know I mean this this is why we want some intermediate range of AD and you know we right now we're setting in the B5 but yeah yeah I mean is there some like nice I don't know is there a formula or something that like nicely calculates this or I mean I mean one can say that if you believe that someone might be modifying you know one word out of out of K let's say then you know you would like to choose some some n that is less than K right so you know it depends on your beliefs about that you know and then and then maybe maybe subject to that constraint uh you know in order for for it to be as as unnoticeable as possible you would like end to be as large as possible sure and then another question I have is um so if I understand this correctly it sounds like you could just release the key for the pseudo random function is is that right or am I misunderstanding some part that's another really good question I guess there is a version of this scheme where you simply release the key now that has advantages and disadvantages okay an advantage would be that now you know anyone can just run their own checker to see you know what what came from GPT and what didn't right they don't even need openai server to do it okay a disadvantage is now now it's all the easier for an attacker to use that ability as well right and to say let me just keep modifying this thing until it no longer triggers the detector right you could also do a thing where if if I'm like you know closed AI which is your open eyes mortal enemy and I can make like this naughty chatbot right which says like horrible horrible things and I can Watermark it to prove yes opening eyes yes Batman I was just coming to that yes that is that is the other issue right that that you know you might worry you know like we've been worried this whole time about the problem of you know GPT output being being passed off for human output but what about the converse problem right yeah what about you know someone taking a living that's not GPT output and passing it off as if it were okay so you know if if we want the watermarking to help for that problem then the key should be secret yeah yeah okay there is by the way I should say you know a a you know an even a simpler solution to that second problem which is that you know GPT could just have available a feature where you can just give people a permalink to like a you know an authorized version of this conversation that I had with GPT so then you know if you want to prove to people that GPT really said such and such then you would just give them that link right that feature you know might actually be coming but I I do think it would be it would be useful yeah so one thing I wonder in the setting where um the key is private yeah like suppose I'm law enforcement and I'm like oh it here's like some letter that like helped some terrorism happen or something and I want to know if it came from open Ai and you know did come from open AI then like uh they're really in trouble I mean I might wonder like is open AI telling the truth like did they actually run this watermarking saying or did they you know just spin their wheels for a while and then say nope wasn't us right yeah so so I mean a more ambitious thing that you might hope for you know to to address this sort of thing would be like uh what are markings team with a public key and a private key right where like anyone could use the public key to check if the watermark was there and yet the private key was required to insert the water yeah yeah okay I I do not know how to do that okay I think that's a that's a very interesting technical problem whether that can be done here's another interesting problem orthogonal but but but the sort of related okay could you uh have watermarking even in a public model okay so could you you know have a language model where you know you're going to share the weights you know let anyone have this entire model uh and yet you know buried in the weights somehow is the insertion of a watermark and uh no one knows how to remove that Watermark without basically retraining the whole model from scratch yeah yeah it seems very similar to this work on like um and starting Trojans and neural networks yes it's not exactly the same thing but I know there's some line of research where you train a neural network such that it sees a cartoon smiley face than an Outlet's horse or something yeah absolutely and and that's actually another thing that I've been thinking about this year and you know less developed okay but uh I can actually easily foresee at this point that the whole problem of you know inserting Trojans or you know inserting back doors into machine learning models detecting those back doors like this is going to be a large fraction of the future of cryptography as a field right I was even you know I was trying to invent a name for it the best I could come up with is neurocryptography but uh so someone else suggested deep crypto but I I think that that's you know I I I don't I don't think we're gonna go with that one okay but uh uh but you know that there were there was actually a paper from just this year by um uh theoretical cryptographers uh Shafi goldwasser uh vinod vaikuntanathan you know both of whom used to be my colleagues at MIT actually and uh two others who uh uh what they showed is is how to um take a certain you know neural Nets and you know they were only sort of able to do this rigorously for depth to neural networks okay but they can insert a cryptographic back door into them okay where on a certain input you know it'll just produce this bizarre output right and even if they publish the weights okay so everyone can see them you know in order to find that Watermark you know you have to solve something that's known in computer science as the planted Clique problem oh right there's basically like you're given a a random and otherwise random graph but into which you know a large Clique you know a bunch of vertices all connected to one another has been surreptitiously inserted and now you you have to find that large planted Clique okay and this is a problem that that you know theoretical computer scientists have thought about you know know for the past couple of decades and you know it it seems to be a hard problem you know enough so that it's you know it's hardness is used as an assumption to do other things right that's when you know that computer scientists have given up on trying to solve some problem right when they use it when they use its hardness as a hypothesis you know for some other theorem that they wanted right so okay so so so it has that status and you know and and they were you know sort of sort of rarely work that sort of looks inside of of neural networks you know they were able to prove the theorem that says yeah in order to detect this back door you would have to be able to solve the planted Clique problem okay so they they give a you know polynomial time reduction only from the from the from the one problem to the other now you know they had been thinking about back doors as you know in AI safety failure right like some bad person might insert this back door for some nefarious purpose yep and then you know it like doesn't it suck for for all of us that that you know that it might be cryptographically uh uh impossible to detect the back door right and then the observation I made guess is that you know as as often happens in computer science uh that lemon could be turned into lemonade right so you know one could also view the insertion of undetectable back doors as a positive for AI safety right yeah and you know in the way that that would go is uh well suppose that you you know one of the oldest discussions in the whole field of AI alignment is the you know this is the off switch discussion right can't you just you know give your super AI you know an off switch can't you just you know have a command where you'll shut it down if things get too out of hand right or or have someone standing by the wall socket you know ready to unplug it you know and and of course of course there's a there's a classic response to these ideas which is that you know once once the AI becomes intelligent enough that it foresees that you will have done those things that you will have you know inserted that that off switch for example and it will do whatever it can to either disable the off switch or to make copies of itself so that you know turning one copy off is no longer effective or to dissuade you from turning it off using its superhuman you know persuasion abilities or or whatever yeah um okay but now the new thing that I think that this adds to that conversation is that you could try to insert a cryptographic office okay so you could when you're training your powerful AI put a cryptographic back door into it okay that says you know if you see this bizarre input you know which might be you know a secret that you know only the the humans will know right then you know then you are to enter a safe mode right or you are to you are to do such and such right and you know that this could be a weird input that would never occur in the ordinary course of the the ai's operation okay and now what what we're to imagine is that you know is that this AI you know you know it might be you know under the control let's say of some other company that has all of the weights and that maybe would like to modify the AI to get rid of that back door right because you know they want to do something bad and not have the AI be shut down or further in the future maybe the AI itself decides that it doesn't want to be shut down and so it will go and modify its own weights in order to prevent that okay so so now so now imagine someone who has the the model so they have all the weights and they know that a back door may have been inserted and they are trying to get rid of right and now now we consider what problem uh that person faces and what's interesting is that from the evidence of the you know the goldwasser at all paper you know that person might have a cryptographically hard problem on their hands right just to detect and eliminate the back door okay now we should be clear that there are some counter measures you know that would be available in this situation for example you could always just use your AI to train a second AI yes I was thinking enough you could all right you could always just build another one that acts like the first one but that will not have the back door in it because you know after all you don't even know where the back door is in order to train about it okay now of course you know the the AI could try to do that you know design a doppelganger of itself right or a different okay if it tries to do that then uh uh however then you know the AI will be faced with its own version of the alignment problem right yeah how to align that other AI with itself right so you know at the very least it would have a a non-trivial job right you can also say if the AI knows that it would never want to shut itself down in any circumstance then it could just make a trivial modification to itself that says if I would ever otherwise output the shutdown command then just don't do that right yeah just replace it with something else right so to be effective to be robust against you know that kind of attack whatever behavior is backdoored in should be something that the AI would have considered doing in the normal course of its operation right but now you know you can you can see that you know that the hazy outlines of you know this game that could be played here you know between cryptographers trying to ins you know hide these kinds of functionalities in machine learning models and you know kryptonalists or neurocrypt analysts yeah like you know trying to detect those functionalities right some of which might be AIS themselves okay and and uh you know I don't know where that's going or who eventually wins but you know I'm now I'm now fairly confident that that's you know that's going to be part of the future of both machine learning and cryptography yeah yeah very interesting um I guess the other like uh potential flaw that comes into my mind is now you have to be like really secure about keeping your key hidden firm like and possibly eyes right yeah I was thinking of writing a Sci-Fi story about it right where like the AIS you know they've they've taken over except that you know that there's this shutdown cake that they haven't yet discovered they know that the humans have it buried somewhere they're you know they're torturing them to try to get them to reveal the location okay but you know there's one one person who you know maybe there's actually seven keys they're like you know Voldemort's horcruxes yeah yeah and AI has to find all seven and and you know and then there's there's one uh heroic human who keeps the seventh one and then just you know uses it to shut down the AI at the last moment right there's a I think I think there's a lot of potential here yeah yeah yeah it's a good story yeah um I I guess another thing that occurs to me uh but perhaps it's more forsake is one thing you're gonna want about the trigger it seems to me uh having thought about this for all of one minute is that uh not only is it gonna have to be the kind of thing which would never otherwise come up but it also has to be indistinguishable in some sense from things which would otherwise come up because otherwise you could have a filter that says like hey if you ever see like a weird barcode thing you know don't press the shutdown button that's right so whatever is the the back door it has to sort of not be surgically removable right yeah yeah and this is you know and then how do you even how do you even formalize what that means right because like like you know anything that the AI does to remove the back door will probably have some other side effect you know on its operation right if it if actually finding the watermark is is this hard cryptographic problem that is only going to sort of sever it you know you know at the cost of modifying itself in some broader way I'm like yeah which but like which modifications do we think it important to uh you know uh protect against and which not yeah actually it strikes me that um that yeah the gbt watermarking thing could just be this right like if you see a text where the average like you know this pseudo random function of anagrams is like really high then you shut off and like you know they're like you can't easily write a rapper to figure out right if that's going on without the private key right right right and now right and then now you might wary that the AI will look inside of itself and it will find some sub-circuit that looks like it's calculating some pseudo-random function and then it will just try to surgically remove that from itself right but yeah I mean you know we're you know you could say you know even even if not there's still the the the the problem on our end of how do we insert that functionality in in sort of an obfuscated way right you know because as I said before you know I don't know how to solve the problem of sort of watermarking GPT sort of internally to the model itself right just sort of changing the weights of the model such that the output will just naturally have this watermark in it right yeah I only know how to take the model and enclose it in this watermarking wrapper so I think there's there's there's there's yet another superb technical question there right of how do you take that wrapper and sort of put it inside the model yeah yeah yeah very very interesting I think I'd like to uh change topics a little bit and talk about um sure basically your recruitment into the AI safety community so I guess the first question is like how did that actually go down like did you just get an email from openai one day like subject matter like save the world from AI kind of so here's what happened I mean um you know like I I've known the oh uh uh alignment Community you know since I mean I was I was reading Robin Hansen and and Elias or yakowski you know since 2006 or seven since you know the uh the overcoming bias uh era right and I I I I knew them you know I I interacted with them you know and then when when I was at MIT I had the privilege to uh teach Paul Christiano he took my uh Quantum complexity Theory course in uh 2010 probably he did a project for that course that ended up uh turning into a a a major paper by by uh both of us which is now called the the Aaronson Christiano Quantum money scheme okay uh he then you know Paul then did his PhD in Quantum Computing with the Umesh fazerani at Berkeley you know who had been who had been my PhD advisor also and then right after that he decided to leave quantum Computing and go full time into AI alignment which he had always been interested in you know even even viva's uh a student with me and uh Paul would tell me about what he was working on so you know I was you know familiar with it that way I mean I mean Stuart Russell you know I had taken a course from him at Berkeley right I knew I knew Stuart Russell pretty well right and he you know reoriented himself to get into AI safety so you know it there started being more and more links between you know AI safety and and sort of you know let's let's call it mainstream CS right so I get you know more and more curious about you know what's going on there you know and are there and and the biggest question you know for me is not is AI going to Doom the world you know can I work on this in order to save the world right that's you know a lot of people expect that would be the question right that's not at all the question the question for me is is there a concrete problem that I can make progress on yeah right because you know it's in science something you know it's not sufficient for a problem to be enormously important right it has to be tracked right there has to be a way to make progress and this was why I kind of kept it in arm's length for as long as I did right it was my fundamental objection was never that super powerful AI was impossible right never thought that right I but but what I always felt was was supposing that I agreed that this was a concern uh what should I do about it right like I you know I don't see a program that is making clear progress right and and so so then what happened uh a year ago was that uh one of the main connections was that you know a lot of the the the people who read by blog are the same people who read you know lesser onion you know or Scott Alexander and you know and are part of their the rationality community and and one of them I I just commented on my blog like Scott what is it going to take to get you to stop wasting your life on Quantum computed and work on you know AI alignment which is you know the one thing that really matters and you know we are known for our attacks in the western community and at the time I was I was just kind of having fun with it I was like uh you know they were like how much money would they have to offer you and I'm like well you know I'm I'm flattered than you would ask but you know it's not it's not mainly about the money for me it's you know it's uh mostly it's about you know is there a concrete problem that I could make progress on you know is it at that stage yet right and uh so then then um it turns out that Jan Liki and uh I think John Shulman you know at open AI you know they uh read my blog and uh and Jan you know who was the head of this HD group and I I got in touch with me you know he emailed me and said yeah you know I saw this and you know how serious are you you know we think that we do have problems for you to think about and uh you know would you like to uh spend a year at open Ai and then you know he put me in touch with uh Ilya satskaver who is the chief scientist at open Ai and I uh talked to him you know I found an extremely interesting guide you know and and I thought okay it's a it's a cool idea but it's just not going to work out logistically because you know I've got I've got two young kids I've got students I've got post-docs and they're all in Austin it you know X's I just can't you know take off and move for a year to San Francisco you know and you know my my wife is also a professor in Austin so you know she she's not moving right and and this and then they said oh well you can you can work remotely you know and you know that's fine and you can just you know come visit us every month or two and you can even still run your your Quantum Computing group and you can keep up with Quantum Computing research and just you know we'll just pay your salary so you don't have to teach and you know and that that sounded like a like a pretty interesting opportunity so I so I decided to take the pledge okay yeah yeah I guess like so once he came in I'm wondering like what aspects of your expertise do you think like transferred best yeah I mean you know I think you know I was I was worried one of my First Reactions when they when they asked was like well you know why do you want me like you do realize that I'm a Quantum complexity theorist right and there is a whole speculative field of quantum AI right how are quantum computers going you know how could they conceivably enhance machine learning and so forth right but but that that's not at all you know relevant to what to what open AI is doing right now anyway right and so you know what what what do you really want me for and you know they the case they made was that first of all they do see theoretical computer science as sort of really Central to AI alignment and you know and frankly I think I think it was Paul Cristiano who helped you yeah yeah who helped convince them of that right Paul was one of the founders of the safety group at open AI before he left to uh start uh Arc and Paul you know has been very impressed by the analogy between AI alignment and these famous theorems in complexity Theory uh like an example being IP equals P space right that uh basically say that like you could have a very weak verifier like a polynomial time bounded verifier that is interacting with an all-powerful and untrustworthy prover right a computationally unbounded proofer and yet you know this verifier by just being clever in what sequence of questions it it asks like it can sort of force the prover to do what it wants like it can learn the answer to any problem in the complexity class P space of any any all the problem solvable with a polynomial amount of memory this is the famous IP equals P space theorem which was proved in 1990 and which shows for example that like if a if you imagine some super intelligent alien that came to Earth you know it's okay of course it could beat us all in chess right that would be that would be no surprise but the more surprising part is that it could prove to us whatever is the truth of chess for example that white has the wit or the chess is a draw some statement like that and it could prove us to that you know via a short interaction and just only very small computations that we would have to do on our end and then even if even if we don't trust the alien right because of the the mathematical properties of this conversation we would know that like it could not have you know consistently answered you know in a way that checks out if it wasn't true that white had the win in chess or whatever so the case that open AI you know was making to me you know arrested heavily on like wow wouldn't it be awesome if we could just do something like that but with in place of the polynomial time verifier we have a human and in place of the you know all-powerful prover we have an AI right and if we could use these kind of interactive proof techniques uh or you know what's related to that what are called probabilistically checkable proof techniques you know in the setting of AI alignment right and you know so so I thought about that and um I was never quite able to make that work as more than an analogy yeah and the reason why not is that all of these theorems you know get get all their their leverage from the fact that you know we can Define what it means for white to have the win in chess right that's a you know a mathematical statement and once we write it as a mathematical statement then we can start playing all kinds of games with it like you know taking all of the the Boolean you know logic gates the and or in not Gates that are used to you know compute who has the win in a given chess position let's say and we can now lift those to operations over some larger Fields like some large finite field this is this this turns out to be the key trick in the IP equals b space proof that you you reinterpret Boolean logic gates as Gates that are Computing a polynomial over a large finite field and then you kind of use the error correcting prop properties of polynomials over over large finite Fields now great so so what's the analog of that for you know the AI loving Humanity right well I don't know it seems you know very specific to you know computations that we know how to formalize right and so there's a whole bunch of sort of really interesting analogies between theoretical computer science and AI alignment that are really really hard to make into more than analogies right so I would I was kind of stuck there for a bit but then you know the the watermarking project to take an example or like once I you know once I started asking myself you know how is GPT actually going to be misused right then you have a more concrete Target that you're aiming at right yep and then once you have that Target then you could think about now what could theoretical computer science do to help defend against those misuses right and then you can start thinking about all the tools of cryptography of you know pseudo-random functions of other things like like differential privacy things that we know about and okay the the watermarking thing you know did draw on some of the skills that I you know the the that I have like you know proving asymptotic balance or or just you know uh uh finding the right way to formalize a problem right these are the kinds of things that that I know how to do right now admittedly you know I could only get so far with that right like I don't know how to do that for it almost become a joke like every week you know I'd have these polls with with Elia satskiver you know I had open Ai and I tell them about my progress on watermarking and he would say well you know that's great Scott and you should keep working on that but you know what we really want to know is you know how do you formalize what it means for the AI to love humanity and uh you know what's the complexity theoretic definition of goodness right and you know I'm like yeah oh yeah you know I'm gonna keep thinking about that right those are those are really tough questions you know but I don't I don't have a lot of progress to report there right so I was like you know I there are the these these aspirational questions and you know at the very least I can write blog posts about those questions right and I want to continue to write blog posts you know with whatever thoughts I have about those questions but then to make research progress right I think one of the keys is to just start with a problem that you know is more annoying than existential right start with a problem that you know is like an actual problem that you could imagine someone have it in the foreseeable future even if it doesn't Encompass the whole AI safety problem in fact better if it doesn't because then you have a much better chance right and then see what tools do you have that you can throw at it now I think a lot of the potential for making progress on AI safety in the near future you know hinges on on the project of interpretability right of looking inside uh models and understanding what they are doing at a neuron by neuron level and you know in one of my early hopes was that complexity Theory you know or my skills in particular maybe would be helpful there right and that I found really hard and the reason for that is that almost all of the progress that people have been able to make has been you know very very empirical right yeah it's been you know it's really been you know if you look at what Chris Ola has been doing for example you know which I'm which I'm a huge fan of right or you look at what Jacob steinhardt's group in Berkeley has been doing and like administering lie detector tests to you know uh neural networks really really beautiful work okay but they come up with ways of looking inside of neural Nets that have no theoretical guarantee as to why they ought to work right what you can do is you can try them and you can see that well sometimes they do in fact work right yeah you know it surprises some people that I used to be pretty good at writing code you know 20 years ago I was never good at software engineering you know at making my code you know work with other everyone else's code and learning an API and you know and getting things done by deadline and so you know like as soon as it becomes that kind of problem then I no longer have a comparative advantage right sure sure I'm not going to be able to do anything that the you know hundreds of talented engineers at openai for example you know cannot do better than me right so for me I really had to find a sort of specific Niche which turned out to be mostly thinking about the you know various cryptographic functionalities that you could put in or around an AI model yep I'm wondering so one thing that um I thought of there is so Paul Cristiano is down Arc and so Arc is um it's the alignment Research Center there are a few people working on like various projects there including like uh evaluations of whether you can get gpd4 to do nasty stuff I guess that's what they're my stands for right now that would that was really nice by the way I read that report the you know where they they were able to get it to uh hire a mechanical turker under false pretenses yeah yeah yeah yeah but I'm wondering I think of a lot of their work is very uh influenced by what seems to me as not a complexity Theory person to be somewhat influenced by complexity Theory so things like listing related knowledge or like um formalizing heuristic arguments is at least very mathematical I'm wondering like yeah what do you think of how that kind of work plays to your strengths yeah yeah that that's a that's a great question because you know Paul has been sending me you know drafts of his papers to to comment on you know including about the the eliciting late knowledge the formalizing heuristic arguments so I've read these things I've thought about them I've had many conversations with Paul uh about them and you know I like it as an aspiration right I like it a lot as a research program right if you read their papers right you know they're they're very nicely written they almost read like proposals or like you know calls to the community to you know try to you know do this research formalize these Concepts right that that you know that they have not managed to formalize and you know and that's they're very talented people but it but it's it's indeed extremely hard to figure out how you would formalize these these kind of Concepts right there's an additional issue which is I would say you know the connection between the problem of formalizing heuristic arguments and the problem of aligning in AI has always been a lot stronger in Paul's mind than it is in mine yeah I mean I think you know even if it had nothing to do with AI alignment like the question of you know how do you formalize heuristic arguments and number Theory that's still an awesome question right I would still I would love to think about that question regardless right sure but you know supposing that we had a a good answer to that question you know does that help AI alignment like yeah maybe I can see how it might on the other hand it might also help AI capabilities right so you know I don't I don't really understand why that is sort of the key to AI alignment I mean I think the chain of logic seems to be something like what we really need you know in order to to do AI alignment is interpretable AI right which means explainable AI okay which means you know not just running the AI and seeing the output but seeing you know why did it produce this output and like what would we have had to change counter factually in order to make it produce a different output and so forth right but then when we try to say what do we mean by that we get into all the hard problems of you know all the hard philosophical questions of you know what is explanation what is causality so maybe you know we should take a detour and solve those Millennial philosophical problems of you know what is explanation and what is causality and so forth and then we should apply that to AI interpretability right which you know I don't know I mean you know I would love to get new insights into you know how to for you know and the the this is something that I've wondered about for a long time also right because like if you'd ask someone in the 1700s let's say like can you formalize what it means to prove a mathematical theorem they probably would have said that that was helpless you know to whatever extent they understood the question at all right but then you know you get fraggy and Russell and you know piano and and zarmulo and Franco uh you know in the early 20th century and they actually did it right they actually succeeded at sort of formalizing you know the concept of provability in math on the basis you know so of of of first order logic and you know various axiomatic theories like like ZF set theory and and so forth and so now you know we could ask the same about explanation right what does it really mean to explain something and you know it seems at some level it seems just as hopeless to answer that question as it would have seemed to a mathematician of the 1700s to explain what it means to prove something right but you know you could say maybe maybe there is a theory of explanation or you know what's closely related you know a theory of causality that will be rigorous and non-trivial and maybe you know that would let us even prove things as interesting as and informative as girdles incompleteness theorem let's say right and that remains to be discovered right so I think that you know one of the biggest steps that's been taken you know in that direction has been uh the work of Judea Pearl right which has given us a a sort of workable not just Pearl but many other people right working in the same area okay but which which is pearls you know one of the clearest and writing about it okay but you know which which has given us sort of a workable notion of you know what do we mean by counter factual reasoning in probabilistic networks you know like like what is the difference between saying uh the ground as well you know because it raid you know a true statement versus you know it rained because the ground is wet you know which is a false statement right yep and in order to formalize the difference we have to go beyond you know a pure language of like Bayesian graphical models we have to start talking about interventions on those models right about if I were to surgically alter you know the State of Affairs by making it rain would that make the ground wet versus if I were to make the ground wet would that cause it to rate right you need a whole language for talking about all these possible interventions that you could make on a system okay and so I've been curious for a long time about can we sort of take all the tools of modern theoretical computer science of complexity Theory and so forth and throw them at understanding causality using Pearl's Concepts right and I haven't gotten very far with that I managed to to ReDiscover some things that turned out to be already known okay but you know this when when I saw the work that Arc was doing on formalizing uh heuristic arguments you know listening late knowledge it reminded me very much of but so you know I think that these are wonderful projects to have like a more principled account of you know what does it mean to explain something right and it may it may be never you know as clear as as what does it mean to prove something right because the same fact could have completely different explanations depending on what the relevant context is right so you know I say like like why did this Pebble fall to the ground right the answer could be you know because of the curvature of space-time it could be because of my proximity to the Earth it could be because I chose to let go of it right and depending on the context you know any of those could be the you know the the desired explanation and the other two could be just uh completely irrelevant yeah yeah so explanation is a slippery thing to try to formalize for that reason I think you know whatever steps we can make toward formalizing it you know would be you know a major step forward in just in science in general and in human understanding I hope that it would also be relevant to AI alignment you know in particular that I say is you know as as a yet a stronger thing to hope for okay cool um so getting back to things you are working on um or things things you're focusing on um yeah I don't know you've spent some time in this arrangement with open AI I don't know if you'll want to answer this uh question on the record but I don't know how's it going and do you think you'll say doing AI alignment things that's a good question certainly if I had worried that uh working in AI was going to be too boring not much was going to happen in AI this year anyway well I need not have worried about that right it's you know it's one of the most exciting things happening in the entire world right now and you know it's been incredible it's been a privilege to have you know sort of a front row seat to you know see you know the the people who were doing this work right wasn't quite as front row as as I might have wished you know I was I wasn't able to be there in person in San Francisco for most of it and you know and and uh video conferencing is nice but it's just it's not quite the same right but you know it is very exciting to be able to to participate in these uh conversations you know the whole area you know moves much faster than I am used to right like I had thought maybe that things move kind of fast and Quantum Computing but it's nothing compared to AI right like I feel like it would almost be a relief to get back to the slow paced world of quantum Computing okay now um um the arrangement with openai was for a one year I'll leave with them and um I don't yet know whether I will have opportunities to you know sort of be involved in this field for for longer you know I would be open I would be interested to discuss that I mean like once you've been nerd sniped or you know got like gotten prodded into thinking about a certain question as long as that those questions remain questions then they're never going to fully leave you right sure you know so I will I will remain curious you know about these things if there were you know some offer or some opportunity to continue to be involved in this field then certainly you know I would have to consider that but you know I I am you know both for professional and for family reasons you know it would not be a very very large activation barrier for me to to move from uh um where I am now so there's so there is kind of that practical problem right but you know if there were a way you know consistent with that you know for me to continue let's say doing a a combination of quantum Computing research and and uh AI Safety Research you know I I could see being open to that all right well uh we're about out of time um I'm wondering if people are interested in following your research or your work uh how should they do that ah well I'm not too hard to find on the internet uh for you know better or worse uh you can go to uh Scott aaronson.com uh that's my home page and I've got all my uh lecture notes from all of my my Quantum Computing courses there I've got a link to to my book uh Quantum Computing since Democritus which was already you know a decade ago but um people are still buying it and uh you know I still get asked to sign it you know when I when I go places and give talks and you know and even invite by high school students sometimes which is gratifying yeah so you know people might want to check out my book if they're interested you know and you know as well as you know all kinds of articles that I've written that you that you can just find for free on my home page and then of course there's my blog Scott aaronson.com blog where that's sort of my living room as it were where I'm you know talking to whoever comes by you know talking to to anyone in the world I mean I've got a bunch of talks that are that are up on YouTube mostly about Quantum Computing but now some of them also about AI safety and you know I've got a whole bunch of podcasts that I've done all right well uh yeah thanks thanks very much for appearing on this one yes absolutely this one was great actually yeah and so the listeners I hope you enjoyed the episode also this episode is edited by Jack Garrett and Alberto nice helped with transcription the opening and closing themes are also by checkout financial support for this episode was provided by the long-term feature fund as well as patrons such as Tor barstat and Ben Weinstein Ron to read a transcript of this episode or to learn how to support the podcast yourself you can visit axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net [Music] [Music] thank you [Music]
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
27 Jul 2023
Superalignment with Jan Leike

This conversation examines technical alignment through Superalignment with Jan Leike, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -10 · avg -7 · 112 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs