Library / In focus
80,000 Hours PodcastTechnical alignment and control
Paul Christiano on how OpenAI is developing real solutions to the AI alignment problem, and his vision of how humanity will progressively hand over decision-making to AI systems

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through Paul Christiano on how OpenAI is developing real solutions to the AI alignment problem, and his vision of how humanity will progressively hand over decision-making to AI systems, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalHigh confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Showing 140 of 283 segments for display; stats use the full pass.
StartEnd
Across 283 full-transcript segments: median 0 · mean -4 · spread -37–17 (p10–p90 -13–0) · 3% risk-forward, 97% mixed, 0% opportunity-forward slices.
Slice bands
283 slices · p10–p90 -13–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: high.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 283 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyalignment80000-hourstechnical-alignmenttechnical
Play on sAIfe Hands
Uses the global player with queue, progress, speed control, and persistent playback.
Episode transcript
YouTube captions (auto or uploaded) · video pkIJgZcf-Qo · stored Apr 8, 2026 · 8,212 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/paul-christiano-on-how-openai-is-developing-real-solutions-to-the-ai-alignment-problem-and-his-vision-of-how-humanity-will-progressively-hand-over-decision-making-to-ai-systems.json when you have a listen-based summary.
Show full transcript
hila this is the 80,000 hours podcast where each week we have an unusually in-depth conversation about the world's most pressing problems and how you can use your career to solve them I am rob woodland director of research at 80,000 hours today's episode is long for a reason my producer Kieran Harris listened to our first recording session and said it was his favorite episode so far so we decided to go back and add another 90 minutes to cover a bunch of issues that we didn't make it to the first time around as a result the summary on the episode can really only touch on a fraction of the topics that come up it really is pretty exciting I hope you enjoy it as much as we did and if you know someone working near AI or machine-learning please do pass this conversation on to them so they can enjoy it as well but just quickly before that I wanted to let you know that last week we released probably our most important article of the year it's called these are the world's highest impact career paths according to our research and it summarizes many years of work into a single article that brings you up to date on what 80,000 hours recommends today that lines our new suggested process which any of you could potentially use to generate a showing a short list of high-impact career options given your personal situation it then describes the five key categories of career that we most often find ourselves recommending which should be able to produce at least one good option for almost all graduates finally it lists and explains the top 10 priority paths that we want to draw attention to because we think that they can enable the right person to do an especially large amount of good for the world I definitely recommend checking it out and so willing to it from the show notes and the blog post yes full today I was speaking with dr. Paul Cristiano Paul recently completed a PhD in theoretical computer science at UC Berkeley and is now a researcher at open AI working on the lining artificial intelligence with human values he blogs at AI alignment calm thanks for coming on the podcast Paul thanks for having me we plan to talk about Paul's views on how the transition to an AI dominated economy will actually occur and how listeners can contribute to making that transition go better but first I'd like to give you a chance to frame the issue of AI alignment in your own words what is the problem of AI safety and why did you decide to work on it yourself well I'm in a sea as the problem of building air systems that are trying to do the thing that we want to do so in some sense that might sound like it's like should be very easy that is we're building the AI system we get to choose sort of all we have to write the code we get to choose how the AI system is trained there are some reasons that it seems kind of hard to train an AI system to do exactly right so we have something we want in the world for example we want to build an AI we wanted to help us cover and better wanted to help us enforce the law willing to help us run a company we have something we want that AI to do there like technical reasons it's not trivial to build the ad that's actually trying to do the thing that we wanted to do that's the alignment problem I care about that problem a lot because I think we're moving towards a world where like most of the decisions are made by intelligent machines and so if those machines aren't trying to do the things humans want them to do then the world is sort of gonna go off in a bad direction like if all the systems like if the AI systems we can build are really good like it's easy to train them to maximize profit or to get users to visit a website or to get users to like press the button saying that they I did well then you kind of have a world that's increasingly optimized for things like making profit or getting users to click on buttons or getting users to spend time on websites without being increasingly optimized for like having good policies heading a trajectory that we're happy with helping us figure out what we want how to get it that's though I'm in a problem the safety problem is like somewhat more broadly like understand things that might go poorly with AI and how we can what technical work and political work we can do to improve the probability that things go well right so um what concretely do you do at open AI so I do machine learning research which is a combination of writing code and running experiments and thinking about sort of how machine learning system should work trying to understand what are the important problems how could we fix them plan out what experiments give us interesting information what capabilities do we need if we want to build aligned AI 5 years 10 years 20 years down the road what are the capabilities we'd what should we do today to work towards those capabilities like what are the hardest parts so trying to understand we need to do and then actually trying to do it makes sense so the first big topic that I wanted to get into was kind of the strategic landscape of artificial intelligence safety research both technical and yes political and strategic and partly I wanted to do that first because I understand it better than the technical stuff so I didn't want to be floundering right off the bat what basically caused you to form your current views about AI alignment and to regard it as a really important problem and and maybe also however your views on this changed over time so there are a lot of a lot of parts of my views on this earth legs a kind of complicated pipeline from do the most good for the most people so I could write this particular machine learning code I think very broadly speaking I come in with this utilitarian perspective of you care about more people more then you start thinking if you take that perspective and you think the future populations will be very large you start asking what are the features of the world today that affect the long run trajectory of civilization I think if you come in with that question like there's a coup very natural categories of things there's if we all die then we're all dead forever and second there's sort of a distribution of values or optimization in the world and that can be sticky in the sense that if you create like entities that are optimizing for something those entities can entrench themselves and be hard to remove in the same way that humans are kind of hard to remove at this point like you try and kill humans humans bounce back there are a few ways you can change like the distribution of values in the world I think the most natural or the most likely ones as we build AI systems we're sort of passed the torch from humans who want one set of things data systems the potentially want different set of things and so in addition to go extinct I think bungling that transition is the easiest way to head in a bad direction or to permanently alter the trajectory of civilization so at a very high level that's kind of how I got to thinking about AI many years ago and then once he or that perspective when then has to look at the actual character of AI and say how likely is this kind of failure mode that is what actually determines what the AI is trying to optimize and start thinking in detail about the kinds of techniques people are using to use to produce AI I think that after doing that I became pretty convinced that it there are significant problems or there's like some actual difficulty there or building that it's trying to do the thing that the human you built it wants us to do like if we could resolve that technical problem that would be great right then we sort of dodged this difficulty of humans maybe passing off control to some systems that don't want the same things we want then like zooming it a little bit more like the whole it's this is a problem with some people care about we also cover a lot of other things though and we're also like all competing with one another which issues a lot of pressure for us to build whatever kind of AI works best so there's some sort of fundamental tension between building AI that works best at the tasks that we want our AI to achieve and building the eye which robustly shares our values or is trying to do the same things that we wanted to do and so it seems like the current situation is we don't know how to build AI that is like maximally effective but still reversely beneficial if we don't understand that then people deploying AI will face some trade-off between those two goals I think by default but competitive pressures would cause people to push far towards the AI that's really effective at doing what we wanted like really effective at acquiring influence or navigating conflict or so on but not necessarily robustly beneficial and so then we would need to either somehow coordinate to overcome that pressure so we'd have to like all agree we're going to build AI that actually does we want it to do rather than building AI which is effective in conflict say or we need to make technical progress so there's not that trade-off so to what extent do you view the arms race dynamic the the fact that people might try to develop AI prematurely because they're in a competitive situation as the the key problem that's that's driving the the lack of safety so I think the competitive pressure to develop AI in some sense is the only reason there's a problem I think describing as an arms race feels somewhat narrow potentially that is like the problem is not restricted to like conflict among statements say it's not restricted even to like conflict per se it's like if we have like really secure property rights so if everyone owns some stuff and the stuff they owned was just theirs then it would be very easy to ignore like individuals could just opt out of AI risk being a thing because they just say great I have like I have some land and some resources in space I'm just gonna chill I'm gonna take things really slow and careful and understand given that's not the case then like in addition to violent conflict they're just like just faster technological progress tends to give you a larger share of the stuff most resources just sitting around and claimed and so if you go faster you get more of them right if you like if there's two countries and one of them like is ten years ahead and technology that country will everyone expects expand first to space and like over the very long run clay more resources in space in addition to violent conflict de facto like they'll have more they'll claim more resources on earth etc I think the problem comes from the fact that you can't take it slow it because other people aren't taking it slow that is we're all forced to develop AI roughly as fast or to develop technology roughly as fast as we could I don't think of it as very as restricted to arms races or conflict among states I think there would probably still be some problem just because people right even if people weren't forced to go quickly I think everyone kind of wants to quickly in the current world that as most people care a lot about having nicer things next year and so even if there were no competitive dynamic I think that many people would be deploying AI at the first time it was practical like become much richer or advanced technology more rapidly so I think we would still have some problem India would be like a third is large or something like that how much attention people paying to these kind of problems now my perception is that the amount of interest has ramped up a huge amount but of course I guess the amount the number of resources going into just increasing the capabilities of AI has also been increasing a lot so it's unclear whether safety has become a larger fraction of the whole so I think in terms of profile of the issue like how much discussion there is of the problem safety has scaled up faster than AI broadly so it's like a larger fraction of discussion now I think that more discussion of the issue doesn't necessarily translate to anything super productive it definitely translates it's like people in machine learning maybe being a little bit annoyed about it there's a lot of discussion discussions scaled up a lot the number of people doing research is also scaled up significantly but I think that's like maybe more in line with the rate of progress in the fields like I'm not sure if the fraction of people working on I'm in full-time actually no I think it's that's also scaled up maybe by like a factor of two relatively or something so if one were to look at like publications and top machine learning conferences there's an increasing number like maybe a few in the last nips that are very specifically directed at the problem we want REI to be doing them that we want it to be doing and we don't have we don't have a way to do that right now let's try and push technology in that direction to build AI is able to understand we want and help us get it so now we're at the point where there's like a few papers in each conference that are sort of very explicitly targeted at that goal up from like zero to one and at the same time like there's kind of aspects of the alignment problem that are more clear so things like building the ad that's able to like reason about what humans want and there's aspects that are maybe a little bit less clear like more arcane seeming so for example think about issues distinctive to like AI which exceeds human capabilities in some respect I think like the more arcane issues are also starting to go from like basically nothing to discussed a little bit what kind of arcane issues are you thinking of so this is there's some problem of building weak AI say want to do what humans want them to do there's then a bunch of additional difficulties that appear when you imagine the air that your training is a lot smarter than you are in some respect so then you need some other strategy so in that regime it becomes when you have a weak AI it's very easy to say what the goal is what you want the eye to do to do something like looks good to you if you have a very strong eye then you actually have like a sort of philosophical difficulty of like what is the right behavior for such a system and it means like all the answers there can be no like very straightforward technical answer where you like prove a theorem and say this is the right rate or you can't really prove a theorem you have to do some work to say this is we're happy with what this AI is doing even though like no human understand say what this AI is doing - in parallel with like device pacification stuff another big part of alignment is understanding like training models that continue to do that you train your mal to do something on the training distribution you're trained your AI on the train distribution it does what you want there's a further problem with like maybe when you deploy it or at the on the test distribution it does something like catastrophic the different from what you want and that's also on that problem I think interest is probably scaled up even more rapidly I'm so the number of people thinking about like adversarial machine learning like an adversary find some situation at which your AI does something very bad then people work on that problem has scaled up and has more than doubled as a fraction of the fields although it's still in absolute terms kind of small what do you think would cause people to seriously scale up their work on on this topic and do you think it's likely to come in time to solve the problem if if you're right that there are serious risks yeah yeah so I think that where we're currently asked it seems clear that there is a real problem that is there's this technical difficulty of building air that as we want to do it's not yet clear if that problem is super hard so I think we're really uncertain about that I'm working on it not because I'm confident super hard but because it seems pretty plausible that it's hard I think that the machine learning community would be much much more motivated to work on the problem if it became clear that this is going to be a serious problem but people aren't super good at coping with like you know well there's a 30% chance this is gonna be a huge problem or something like that I think one big thing is like as it becomes more clear than I think many more people will work on the problem so I talked about these issues of like training weak area systems to do what humans want them to do I think it is becoming more clear that that's a big problem so for example we're getting to the point where robotics is getting good enough that it's going to be limited by or starting to be able to buy like can you commute it's the robot what it actually ought to be doing or like people are becoming very familiar with like you know YouTube has some algorithm that decides what video will show you people have some intuitive understanding don't like that algorithm has a goal and like if that goal is not the goal that like we elected Li as school and the users of YouTube would want that's gonna kind of push the world in this annoying direction it's gonna push the world towards like people spend a bunch of time on YouTube rather than the lives being better seems like we are currently at the stage where like some aspects of these problems are becoming more obvious and that makes it a lot easier for people to work on those aspects as we get closer to a I like assuming these problems are serious this can become more and more obvious that the problems are serious that is like will be building AI systems which like humans to understand what they do and the fact that their values are not quite right is causing serious problems I think that's one axis and then the other axis is right so I'm particularly interested in the possibility of transformative AI it has a really like a very large effect on the world it's like yeah it starts replacing humans and the great majority of economically used for work I think that right now we're very uncertain about what the timelines are for that I think there's like a reasonable chance within 20 years say but it's certainly does not compelling evidence that's gonna be within 20 years I think as that becomes more obvious than it will many more people will start thinking about catastrophic risks in particular because those will become more plausible so your concerns about how transformative AI could go badly have become pretty mainstream but not everyone is convinced how how compelling do you think the arguments are that people should be worried about this and is there anything that you think that you that you'd like to say to try to persuade skeptics who might be listening so I think almost everyone is convinced that there is almost era in machine learning is convinced that there is a problem that like there is an alignment problem because there's the problem build AI it's trying to do what you want to do and that that requires some amount of work I think the point of disagreement there's a few points of disagreement like within the machine learning community so one is is that problem hard enough that it's a problems like worth trying to work on it's worth trying to focus on and trying to push differentially or is that the kind of problem that's just should get solved in the normal business of doing AI research so that's one point of disagreement I think on that point I think in order to be really excited about working that problem you have to be thinking like what can we do to affect how AI goes better if you're just asking like how can we have really powerful ad that does good things as soon as possible then I think it's actually not that compelling an argument work on alignment and I think if you're asking the question like how do we actually maximize the probability this goes well then like it doesn't really matter whether that's part like whether that ought to be part of the job that they are researchers we should all like which would be really excited about putting more resources into that to make it go faster and I think they're like if someone really take seriously the goal of trying to make a go well instead of just trying to push on AI and trying to make cool stuff happen sooner I'm we're trying to realize benefits over the next five years then I think that case is pretty strong right now another place does a lot of disagreement in the ml community is like maybe it's more an issue of framing than an issue of substance which is the kind of thing I find pretty annoying but like there's one frame where you're like yeah it's very likely to kill everyone there's going to be some robot uprising it's gonna be a huge mess this should be like the top for a list of problems and there's another framing where it's like well if we as the immunity failed to do our jobs then yes something bad would happen but like it's kind of offensive for you to say that we as the AG community are going to fail to do our jobs I don't know if I like really need to doesn't seem like you should have to convince anyone on the second issue though you should just be like yes it'll be really bad if we fail to do our jobs and now like this discussion we're currently having is not like part of us trying to argue that like the world like that everyone should be freaking out this is us trying to argue like this is us doing our jobs the discussion we're having right now and like you can't in a discussion about how to do our jobs be like yes like it's gonna be fine because we're gonna do our jobs like that that is an appropriate response in some kinds of discussion maybe but when you're having the conversation about are we gonna like spend some money on this now then yeah I think it's not such a great response and like I think safety is a really unfortunate word and like lots of people don't like safety but it's kind of hard to move away from does if you describe these problems let me describe the problem layout let's train an AI to do we want to do two people do like why do you call that safety that's the problem like building good AI that's fought like I'm happy with that I'm happy saying yeah this is just like doing AI reasonably well but then yeah it's not really an argument about why one shouldn't push more money into that area or shouldn't like push more effort into that area it's like a part of AI that's particularly important whether as a positive or negative effect be I think your my experience does like the two biggest disagreements so the biggest substantive disagreement is on this like is this the thing that's gonna get done easily anyway I think they're like people tend to have maybe it's just like a normal level of overconfidence about how easy problems will end up being together with like not having a real I think there aren't that many people who are really prioritizing the question how do you make AI go well and just how do you make like choose some cool thing they want to happen how do I make that cool thing happen as soon as possible like in calendar time I think that's like unfortunate it's a hard kind of thing to convince people on in part because like yeah valleys discussions always a little bit hard so what do you think the best arguments against being concerned about this issue or at least wanting to prioritize directing resources towards it and why doesn't it persuade you as I think there's a few classes of arguments probably the ones I find most compelling are opportunity cost arguments or someone says like here's a concrete alternative like yeah you're concerned about X but like have you considered that Y is even more concerning like it imagine someone saying look the risk of like bioterrorism killing everyone is high enough that you should like on the margin returns to that are higher than returns to a safety and Melissa not compelled by those arguments well and part is like an advantage thing we're like I don't really have to evaluate those arguments because like it's sort of clear what my competitive advantages and in part like I have a different reason I'm not compelled for every argument of that form so that's like one class of arguments against in terms of the actual like value of what county I safety I think the biggest concern is this like is this an easy problem that will get solved anyway and maybe the second biggest concern is like is this a problem that's so difficult that like one shouldn't bother working on it or like one should be assuming that we need some other approach like you could imagine a technical problem is hard enough that almost all the bang is gonna come from policy solutions rather than from technical solutions and you could imagine those two concerns like maybe sound contradictory aren't necessarily contradictory because you can say like we have some uncertainty about this parameter of like how hard this problem is you could have to do that either it's going to be easy enough that it's solved anyway or it's gonna be hard enough that like working out now isn't going to help that much and so it mostly matters just getting our policy response in order I think I don't find that compelling in part because like one I think the significant probability on the range like the place in between those and to like I just think we're contes problem earlier like will tell us what's going on like if we're in the world where we need a really drastic policy response to cope with this problem then you kind of want to know that as soon as possible and it's like not a good move to be like we're not gonna work on this problem because if it's serious we're gonna have a dramatic policy response because you want to work on it earlier discover that it seems really hard and then have like significantly more motivation for trying the kind of coordination you'd need ticket around it it seems to me like it's just too soon to say whether it's very easy moderately difficult or very difficult to some right that's definitely my take so I think people make some arguments in both directions and likely to talk about particular areas people may like overall I find them all just pretty unconvincing I think a lot of though like it seems easy it comes from just the intuitive like look we get to build the AI we get to choose the training process we gets like look at all the competition the eyes doing as and things like how hard can it be to get the AI to be like trying to do or maybe not me it's hard to get to try do exactly what you want but how hard can it be is like get it to not try and kill everyone like that sounds like a pretty there's a pretty big gap between the behavior we watch and the behavior like reasoning about what output is going to like most lead to humans being crushed that's like a pretty big gap feels like go up BL distinguish those I think that's not like there's something to that kind of intuition like it is relevant to the reasoning about harder problem is by just like doesn't carry that much weight on its own like you really have to get into the actual details of like how are we producing our systems had it likely to work what is the distribution of possible outcomes in order to actually say anything with confidence I think once you do that like the picture doesn't look quite as rosy you mentioned that one of the most potentially compelling counter arguments was that there's just other really important things for people to be doing that might be even more pressing yeah what are things other than air safety do you think are among the most important things for people to be working on so I guess I have I have two kinds of answers to this question one kind of answer is like what's the standard list of things people would give which I think are the most likely things to be good alternatives so for example like amongst the utilitarian crowd I don't feel like talking about essential risk from engineered pandemics is like a very salient option I move this like a somewhat broader biotech category I give other things like in this genre I'm one can also look at the world more broadly it's like intervening on political process like an improved political institutions or just like push governance a particular direction that we think is conducive like a good world or like a world on a good long-term trajectory those are examples of problems like lots of people would advocate for and for our my best you know I think if lots of people think X is important that's good evidence that X is important the second kind of answer which is like the problems that I find most tempting to work on which is going to be related to it's gonna tend to systematically be things that other people don't care about I think there's a lot of value yeah one can add a lot of value if there's a thing that's important you care about the ratio between how important it actually is in like or like how important other people think it is and how important it actually is so at that level things that I'm like I'm particularly excited about like very weird utilitarian arguments so as I'm particularly excited about people doing more thinking about what actual features to the world affect whether on a positive or negative trajectory so thinking about things like there's a lot of considerations that are extremely important from this like long-run utilitarian perspective that are just not very important according to people's normal normal view of the world or like the normal values so they've like one big area is just thinking about and acting on sort of that space of considerations so like an example which is a kind of weird example but hopefully illustrates the point is like normal people care a ton about like whether humanity like we care a ton about catastrophic risks they would really care if everyone died I think it's like a weird Attila Terry and you're like well it'd be bad if everyone died they like even in that scenario it's like a bunch of weird stuff you would do to try and improve the probability that things turn out okay in the end so there's include things like working on extremely robust bunkers that are capable of like repopulating the world or like trying to write an extreme case where like all humans die you're like well we'd like some other animal later to like come along and involve intelligent life again and colonize the stars and like those are weird scenarios the scenario that like basically no one tries to push on like no one is asking you know what could we do as a civilization to make it like better for the people who come after us if we managed to pull ourselves up and so because no one's working on them even though they're like not that important in absolute terms like I think it's reasonably likely that they're good things to work on those are examples of kind of weird things there's a bunch of not as weird things that also seem pretty exciting to me especially things about like improving how well people are able to think are improving how well institutions function which I'd be happy to get in more detail on but are not things that an expert in yeah maybe you just want to want to list off a couple of woods so just like all the areas that see him like are like high level areas that seem good to me so we're listed like thinking about the utilitarian picture and what's important to a future focused utilitarian there's thinking about extinction risks like maybe ascension risks that are especially interesting to people to care about extinction so like things like bunkers things like repopulation in the future things like understanding the tales of understanding the tales of normal risks to understand the tales of climate change understand the tales of nuclear war more normal interventions like pushing on peace but especially with like and I towards like avoiding the most extreme forms of war or like mitigating the severity of like all-out war pushing on institutional quality it's like experimenting with institutions like prediction markets are different ways of ever getting information or making decisions across people just like running tons of experiments and understanding what factors influence like individual cognitive performance or into the performance within organizations or for decision making like an example of a thing that I'm like kind of shocked by is how little study there is of nootropics and cognitive enhancement broadly I think that's like a kind of thing that's like a relatively cheap and seems like such a such good bang for your buck an expectation that it's pretty damning for civilization that we haven't invested in it yeah those are a few examples okay great coming back to AI how important is it to make sure that the the best AI safety team ends up existing within the organization that has the the best general machine-learning firepower behind it so you could imagine splitting up the functions of people were going to add safety into two categories one category is developing technical understanding which is sufficient to build a line day I so this is like doing research saying here are some algorithms here are some here's some analysis that seems important and then the second function is actually affecting the way that an AI project is carried out to make sure it reflects our understanding of how to build in the lines AI so for the first function it's not super important like for the first function if you want to be doing research on alignment you want to have access to machine learning expertise so you need to be somewhere that's like doing a reasonably good machine learning research but it's not that important that you be at the place that's actually like at the literal cutting edge so there's a second function it's like quite important so if you imagine someone actually building you know very very powerfully AI systems like I think the only way in practice they'll like society's expertise about how to build aligned AI is going to affect the way that we build a GI is by having a bunch of people who have like made it their career to understand those considerations and work on those considerations who are involved in the process of creating a GI and so for that second function it's like quite important that if you want and they add to be say if you need like people involved in the development that AI to be like basically be alignment researchers do you think we're heading towards a world where we have the right distribution of people yeah so I think things are currently okay on that front I think as we get closer it so we're sort of currently in a mode where we can imagine we're like somewhat confident there won't be powerful a systems within like two or three years and so over the short term like there's not as much pressure as there will be closer to the day it's like a psych consolidate behind projects that are posing a catastrophic risk and I would be optimistic that like if we're in that situation where we actually faced it was significant prospect of like essential risk from AI over the next two years then there would be significantly more pressure for both pressure for like safety researchers to really follow wherever like that AI was being built or like be allocated across the organizations they're working on AI that poses an existential risk and also a lot of pressure within such organizations to be actively seeking safety researchers my hope would be the you don't have literally pick like a safety reduce you have to pick a long time in advance what organizations you think we'll be doing that development you can say we're gonna try and understand like develop the understanding that is needed to make this AI safe we're gonna work on organization that it's like amongst those that might be doing like development of dangerous AI and then we're going to try and live in the kind of world where like as we get very close there's a lot of like people understand the need for and are motivated to concentrate more expertise in alignment and safety and then that occurs at that time it seems like there's some risks to creating new organizations because you get kind of a splintering of the effort and also potential coordination problems between the different groups how do you feel we should split you know additional resources between just expanding existing research organisms versus creating creating new projects so I agree that the extent we have a coordination problem amongst developers of AI like to extend the field is harder to start to reach agreements or regulate as they're like more and more actors then all sequel you prefer not have a bunch of new actors I think that's mostly the case for people doing AI development so for example for projects that are doing alignment per se like I don't think it's a huge deal and should mostly be determined by other considerations whether to contribute to existing efforts or create new efforts I think in the context of AI projects I think all sequel like one should only be creating new and like if you're interested in alignment you should only be creating new AI projects where you have some like very significant interest in doing so it's not a huge deal but it's nicer to have a smaller number of more pro-social actors than to have like a larger number of actors with uncertain or even like a similar distribution and motivations so how much of the variance in outcomes from artificial general intelligence in your estimates comes from uncertainty about how good will be at actually working on the technical a eye alignment problem versus uncertainty about how firms that are working to develop AG I will behave and potentially you know the government's in the countries that they're operating how they're going to behave yeah I think the largest source of variance is neither of those but is instead just how hard is the problem like what is the character of the problem yeah so after that I think the biggest uncertainty though not necessarily the highest place to push is about like how people behave that is how much investment do they make how well are they able to reach agreements how motivated are they in general to like change what they're doing order to make things go well so I think that's a larger source of variance than like technical research that we do in advance I think it's specially a harder thing to push on in advance like pushing out how much technical research we do in advance is very easy like if you want to increase that amount by 10% that's incredibly cheap whereas having a similarly big change on how people behave would be a kind of epic project but I think that more of the variance comes from how people more the variance comes from how people behave I'm like very very uncertain about the institutional context in which I will be developed concerning about how much each particular actor really cares about these issues or when push came to shove how far out of their way they would go to avoid catastrophic risk I'm very uncertain about how feasible it will be to make agreements to avoid like race to the bottom on safety another question that came in from a listener was um I guess a bit of a hypothetical but it's interesting to probe your intuitions here what do you think would happen if several different firms or countries simultaneously made a very powerful general AI and some of which were aligned but some of which weren't and potentially kind of went rogue with their own agenda do you think that'd be a very bad expectation situation in expectation my normal model does not involve a moment where you're building powerfully I so that is instead of having like a transition from nothing to very powerful yeah you have like a bunch of actors gradually ratcheting up that capacity the system's they're able to build but even if that's false I like expects developers to generally be like really well financed groups that are like quite large and so if there are smaller groups I generally expect them to like write up the task and effectively pull resources in one way or another you did by like explicitly resource sharing or by merging or by normal like trading with each other but we can still imagine like I say in general like suppose distribute across the world you're a bunch of powerful AI systems some of which are aligned some which aren't aligned I think like my default guess about what happens in that world is similar to saying if like 10% of the AIS are aligned then we capture like 10% as much values of 100 percent of them are aligned it's roughly in that ballpark does that come from the fact that those you know a 10% chance that one out of 10 AG eyes would in in genetic overall you have more of a view where there's gonna be kind of power sharing or each group gets like a fraction of the influence as in the world today yeah so I think I I don't have a super strong view on this and in part I don't have a strong view because I end up at the same place regardless of how much stochasticity there is like whether ever you get 10% of the stuff all the time or all the stuff 10% of the time I don't have a incredibly strong preference between those four kind of complicated reasons I think I would guess so in general if there's like two actors four equally powerful they could like fight it out and then just see what happened and then like from behind the veil of ignorance each of them wins like half the time and like crushes the other I think normally people would prefer like reach compromises short of that so that is like imagine like how that conflict would go and say well like you know if you're someone who would be more likely to win then you'll like extract a bunch of concessions from the weaker party but like everyone is incentivized to reach an agreement where they don't have an all-out war and in general I'd like that's how things normally go amongst humans like we're able to avoid all-out war most of the time though not all the time I would in general guess that AI systems will be better at that they're certainly in the long run I think it's pretty clear as systems will be better at like negotiating to reach positive some trades we're avoiding war is often an example of a positive some trade it's conceivable in the short term the OBS systems that are very good at some kinds of tasks and not very good at like diplomacy are not very good at reaching agreement or this these kinds of tasks but I don't have a super strong view about that I think that's the kind of thing that would determine like to what extent you should predict there to be war like if people have transferred most of the decision majority to machines or like a lot decision-making authority machines then you care a lot about things like are machines really good at Wigan war but not really changing the process of diplomacy and like if they have differential ability and when that kind of respect then you get an outcome that's more like random and someone will crush everyone else and if you're better at striking agreements then you're more likely to say like well look here's the allocation of resources like here's the well I'll hit influence according to like the results of what would happen if we fought the like let's all not fight one topic that you've written quite a lot about is credible commitments and the need for organizations to be honest and I get I guess part of that is because it seems like it's going to be very important in the future for organizations that are involved in the development of AGI to be able to coordinate around safety and alignment and to avoid getting into races with one another and or to have just a general environment of mistrust where they have reasons to go faster in order to compete other groups has anyone ever attempted to have organizations that are as credible as within their committees as this and do you have much hope that we'll be able to to do that so certainly I think in the context of like arms control agreements and monitoring there is some efforts are made for like one organization to be able to credibly commit that there are like credibly demonstrate that they're abiding by some agreement I think they like the kind of thing I've talked about so I wrote this blog post on honest organizations I think the kind of measure I'm discussing there is like both somewhat more extremes and things that would like that government would normally be open to and also sort of more tailored for this setting where you have an organization which is currently not under the spotlight which is trying to set itself up in such a way that it's prepared to be trustworthy in the future if it is under the spotlight I'm not aware of any organizations having tried that kind of things like a private organization saying well we expect some day in the future like we might want to coordinating this away I'd be regulated in this way so we're gonna try and Casta to ourselves such that it's like very easy for someone to verify that we're complying with an agreement or a law I'm not aware of people really having tried that much I think there's like some things that are kind of implicitly this way like you know companies can change who they hire it like they can try and be more trustworthy by like having executives or in people on the board or having like monitors embedded within the organization that they think like stakeholders will trust it's certainly a lot of precedent for that yeah I think the reason you gave for why this seems important to me in this context is basically right like I'm concerned of setting where there's some trade-off between the capability data systems you build and safety I like in the context of such a trade-off you're reasonably likely to want some agreement that says like everyone is going to meet this bar on safety given that everyone is committed to meet that bar there's not really an incentive than to cut or like they're not able to follow the incentive to cut corners on safety say and so you might want to make that like that agreement might take place as an informal agreement amongst developers it might take place as like domestic regulation where like law enforcement would like to allow AI companies continue operating but would like to verify they're like not gonna take over the world it might take the context of like agreements among states which would themselves be largely like an agreement among states I would involve like you know the US are China having some unusually high degree of trust or insight into what firms in the other country are doing ads like it thinking forward to that kind of agreement it like seems like you would need machinery in place that's not currently in place or would be very very hard I'm at the moment so anything you could do to make it easier seems like it would be potentially we could make it like quite a lot easier there's a lot of room there is this in a sofa reason for anyone who's involved in AI research to maintain an extremely high level of integrity so that they so that they will be trusted in future I think having a very high level of integrity sounds good in general like you know as a as utilitarian ideal I get like the people engaged and important projects are mostly in it for like their stated goals and want to make the world better it seems like there is a somewhat different thing which is like how trustworthy are you to like the external stakeholders who wouldn't otherwise have trusted your organization which i think is different from the normal like an effort to bring people by integrity that would be quite different ranking than ranking them by like yeah demonstrable integrity so like people very far away who don't necessarily trust the rest of the organization they're involved in I didn't quite get that could can you explain that so I could say it is both like if I'm interacting with someone in the context like I'm interacting with a colleague I have some sense of like how much they connect themselves with integrity and that's like one I could rank people by that I have like love it if the people who were actually involved in making it our people who I drink is like super high integrity there's then a different question which is like suppose you have some firm and then you have there's like someone in the Chinese defense establishment reasoning about the conduct of that firm and like they don't really care that much probably if there's like someone who I would judge as high integrity involved in the process because they don't have like the information that I'm using to make that judgment like from their perspective they care a lot about the firm being in structured such that they like feel like they understand what the firm is doing I mean they don't feel like any uncertainty about whether like in particular they have like a minimal suspicion that like a formal agreement is just cover for like US firms to be like cutting corners and delaying their competitors so they really want to have a lot of insight into what is happening at the firm I don't have some confidence that there's not some unobserved collusion between the US defense establishment and this firm that nominally is like complying with some international agreement to undermine that agreement that's the example of like states looking into firms but also an example of firms looking into firms similarly like you know if I am looking in there's some like notion of integrity to be relevant like to researchers if I do look like interacting with each other and thing about like how much integrity they have and there's something quite different that would be helpful for like me looking into research at value actually believing that like yeah research at Baidu is being conducted like when they make public statements this statements are an accurate reflection of what they're doing they aren't collaborating you know there isn't behind the scenes a bunch of work to undermine nominal agreements yeah I think that it is very valuable for people in this industry to be trustworthy for all of these reasons but I guess I am a bit skeptical that trust alone is gonna be enough in part for the reasons you just gave and that there's that a the famous Russian proverb trust but verify and it seems like there's been a lot of talk at least publicly about the importance of trust and maybe not enough about how we can come up with better ways of verifying what people's behavior actually is I mean what one option I guess would just be to have people from different organizations or working together in the same building or to move them together so they can see what other groups are doing which allows them to have a lot more trust just because they have much more visibility how do you feel about that yeah so I think I would be pretty pessimistic about reaching any kind of substantive and space agreement based only on trust for the other actors in the space like it may be possible and some yeah it's like conceivable amongst like Western firms that are like already quite closely like where there's been a bunch of turnover staff from one to the other and like everyone knows everyone it's like maybe conceivable in that case but in general when I talk about agreements I'm imagining like trust as a compliment fairly involved monitoring and enforcement mechanisms the monitoring enforcement problem in this context is quite difficult that is it's very very hard for me to know suppose Irish for a man from be a rich some nominal agreement like they're only going to develop some AI that's safe according to some standard like it's very very hard for firme to demonstrate that to firm B without like literally showing all of their without giving firm be enough information they could basically just take everything or like benefit from all of the research that firm a is doing there's no like easy solution to this problem the problem is easier to the extent that you believe that like say the firm is not like running a completely fraudulent operation to maintain some appearances but then in addition to like having some addition to having enough insight to verify that you still need to do a whole bunch of work to actually control like how development is going you know I'm just writing a bunch of code on some giant computing cluster you can look and you can see indeed they're running some code on this cluster and even if I literally showed you all of the code I was running on the cluster that's actually not that wouldn't be that helpful right it's kind of very hard for you to trust what I'm doing unless you like literally have watched the entire process by which the code was produced or this is like you're confident there wasn't some other process hidden away that's right in the real code and the thing you can see is just a cover by which like you know it looks like we're running some scheduling job but actually like it's just that like it's carrying some real payload it's like a bunch of actually I research the results are getting smuggled out to the real a research group could you have an agreement in which every organization accepts that all of the other groups are going to try to put clandestine informants inside their organization and that that's just an acceptable thing for everyone to do to one another because it's the only way that you could really believe what what someone's telling you this I think there's sort of a split between two is of doing this kind of coordination on one arm you try and maintain something like the status quo we have a bunch of people independently pushing on add progress in order to maintain that arm there's like some limit on how much transparency different developers can have into each other's research that's one arm and there's a second arm where you just give up on that and you say yes like all the information is going to leak and like I think the difficulty in the first arm is that it's like incredibly you have to like walk this really fine line or you're like trying to give people enough insight which probably does involve like monitors whistle blowing other like mechanisms whereby like there are people who firme trusts embedded and firm B that's what makes it hard to do monitoring without leaking all the information you have to walk that fine line and then if you want to leak all the information then the main difficulty seems to be like you have to reach some new agreement about like how you're actually going to divide the fruits of a research like right now there's sort of so implicit status quo we're like people who make more AI progress expect to capture some benefits by virtue of having made more AI progress so you could say no that's we're gonna deviate from the status quo and just agree the like we're gonna develop AI effectively jointly either because it's literally joint or because like we've all opened ourselves or like the leader has opened themselves up to enough monitoring this used to be the leader and if you do that then you have to reach some agreement where you say here's how we compensate the leader for the fact that they were the leader or in that are the leader has to be willing to say yeah I used to be like have a high valuation because I was doing so well in AI and now I'm just happy to grant the like that advantage is gonna get eroded and I'd have to do that because it reduces the risk of the world being destroyed I think like both of those seem like reasonable options to me and like which one you take depends a little bit on like how serious the problem appears to be like what the actual structure the field is like we're like the coordinating is more reasonable if like the relevant actors are kind of close such that well it's more reasonable of either like if there's an obvious leader who should like is going to capture the benefits and is feeling or like reasonably is willing to distribute them or if like somehow there's not a big difference between the players so it's like erasing AI as like a fact like you know if you imagine the US and China both believing that like like things are hard if each of them believes that they're ahead in AI and each of them believes that like they're going to benefit by having ad research which in available to their competitor can surprise both them believe that they're ahead and things are easy at both than bleated they're behind and if they're like both have an accurate appraisal situation and understand there's not a big difference then maybe are also okay cuz everyone's fine saying sure I'm fine leaking because I know they like that's roughly the same as like I'm not gonna lose a whole lot by leaking information to you okay let's turn now to this question of fast versus slow takeoff of artificial intelligence historically a lot of people have been worried about AI alignment have tended to take the view that they expected progress to be relatively gradual for a while and then to suddenly accelerate and take off very quickly over a period of days or weeks or months rather than years but you for some time been promoting the view that you think the take off of general AI is going to be more gradual than that you want to just explain explain your general view yeah so I it's worth clarifying that when I say slow I think I still mean very fast compared to most people's expectations so I think that by transition taking place over like a few years maybe two years between AI having like very significant economic impact and literally doing everything sounds pretty plausible I think when people think about such a two-year transition to most people on the street that sounds like a pretty fast takeoff I think that's important to clarify it when I say slow I don't know what most people think of a slow another thing that's important to clarify is that I think there's rough agreement amongst the alignment and safety crowd about what would happen if we did have human-level AI that is everyone agrees that kind of at that point progress has probably exploded and is occurring very quickly and the main disc agreement is about what happens in advance of that so I think I have the view that in advance of that the world has already changed very substantially you're already likely exposed to catastrophic AI risk and in particular when someone develops a human level III it's not going to emerge in a world like the world of today where we can say that indeed having human level III today would be where they'll give you a decisive advantage instead it will emerge in a world which is already much much crazier than the world of today we're having her human-level AI gives you some more modest advantage yet do you want to paint a picture for us of what that what the world might look like is I guess they're a bunch of different parts of the worlds and I can focus on different ones but they can try and give some random facts or some random views like facts from that world they're not real facts they're they're Paul's wild speculations so I guess in terms of like calibrating what ad progress looks like like how rapid it is I think maybe two things that seem reasonable to think about are the current rate of progress in information technology in general so that would suggest something like maybe in the case of AI like falling and cost by a factor of two every like year ish or like every six to twelve months and another thing that I think is important to get an intuitive sense the scale is like to compare to intelligence in nature so I think when people do intuitive extrapolation of AI they often think about like abilities within the human range one thing that I do agree with proponents of fast takeoff about is that that's not a very accurate perspective when thinking about AI I think a better way to compare is to like look at what evolution was able to do with varying amounts of compute so if you look at like what each order of magnitude buys you in nature where you're going from like insects small fish to lizards - rats - crows to primates to humans each of those is like one order of magnitude roughly it's like you should be thinking of right there are these jumps that it is the case that the difference between insect and lizard like feels a lot smaller just and is less intuitive significance than like the difference between primate and human or crow and primate so what I'm thinking about AI capabilities I'm kind of imagining intuitively and this is not that accurate but I think it's useful as an example to ground things out I imagine like this blind raising and like one day you have or like one year you have an AI which is capable of like very simple learning tasks and motor control and then a few years later you know a year later you have any ID is capable of like slightly more sophisticated learning like now it like learns as well as a CRO or something is like starting to get deployed as quickly as possible in the world and having a transformative impact and then it's like a year later that a I has taken over the process of doing science from humans yeah I think that's important to have in mind as background like talking about what this world looks like what tusks can you put an AI that's as smart as a crow on better that are economically valuable so I think there's a few kinds of answers so one place where I think you definitely have a big impact is in robotics and domains like manufacturing logistics and construction and that is I think Laurie animals are probably they're good enough that motor controlled that you'd have much much better robotics than you have now so today I would say robotics doesn't really or robots that learn like don't really work very well or at all today like the way you get robotics to work as you like really organize your manufacturing process around them they're quite expensive and tricky and it's just like hard to roll it out I think in this world probably even before you have crew level AI you have robots that are alike very general and flexible they can be applied not only in on an assembly line but okay one that can take the place of humans on assembly lines quite reliably but they can also then be applied in like logistics so like loading and unloading trucks driving trucks managing warehouses construction image identification as well they could certainly do image identification well I think that sort of thing we get a little bit earlier so I think that's a large part of like today those activities are a large part of the economy like maybe the stuff we just listed is something like I've actually know in the US it's probably low here than elsewhere but still like more than 10% of our economy less than 25% there's another class of activities like if you look at like the intellectual work humans do I think a significant part of it could be done by very cheap a eyes at the level of have level of crows or not much more sophisticated than crows there's also a significant part that requires a lot more sophistication I think we're like very uncertain about how hard like doing science is so as an example like I think back in the day we would have said that like playing board games that are designed to tax human intelligence like playing chess ergo is like really quite hard and like it feels to humans like they're really able to leverage all their intelligence doing it it turns out that like playing chess from the perspective of like actually designing a computation to play chess is incredibly easy so it takes a brain like you know brain much smaller than insect brain in order to play chess much better than I think it's pretty clear at this point the science like mixed better use of human brains than chess does but it's actually not clear how much better so it it's totally conceivable from our current perspective I think that like an intelligence that was as smart as a crow but was actually designed for doing science actually designed for doing engineering for advancing technology as rapidly as possible it's quite conceivable it's such a brain would actually out-compete humans pretty badly at those tasks so I guess that's another important thing to have in mind then when we talk about like when stuff goes crazy I would guess humans like an upper bound for when stuff goes crazy that is we kind of know that if you had cheap simulated humans the technological progress would be much much faster than it is today but like probably stuff goes crazy somewhat before you actually get to humans it's not clear like how many orders of magnitude smaller brain can be before it goes crazy I think like probably at least one seems kind of safe and then like two or three is definitely plausible so it's a bit surprising to say that science isn't so hard and that there might be a brain that in a sense is much larger than a human that could blow us out of the water in doing science can you explain okay could you can you try to make that more intuitive yeah so I mentioned this analogy to chess which is that like when humans play chess we apply a lot of faculties that we evolved for other purposes to play chess well and we play chess like much much better than someone using like pencil and paper to mechanically play chess at the speed that a human could like we're able to get a lot of mileage out of all of these other you know we evolved like be really good at physical manipulation and planning in physical contexts and reasoning about social situations like that makes us in some sense like it lets us play good chess much better than if we didn't have all this capacities that said like if you just write down a simple algorithm for playing chess and you run it with a tiny tiny fraction of the compute that a human uses in order to play chess it crushes humans incredibly consistently and so in a similar sense like if you imagine this project of like you know look at some technological problem consider a bunch of possible solutions I understand like what the real obstructions are and how we can try and overcome those obstructions like a lot of the stuff we do there like we know that humans are much much better than a simple mechanical algorithm applied to those tasks that is we're able to leverage like all these abilities that we all visibles that helped us in the evolutionary environment we're able to leverage to do like really incredible things in terms of technological progress or in terms of doing science or designing system cetera but what's not clear is if you actually had created it so again if you take the computation in the human brain and you actually put it in a shape that's optimal for playing chess it plays chess many many orders of magnitude better than a human and so similarly like if you took the competition the human brain and you actually like reorganized it so he said now instead of a human explicitly considering some possibilities for like how to approach this problem a computer is gonna generate you know a billion possibilities per second for like possible solutions to this problem so in many respects we know that that computation would be like much much better than humans at resolving some parts of like science and engineering there's been a question of how sort of exactly how much leverage are we getting out of all these evolutionary heuristics and so it's kind of not surprising that in the case of chess we're getting you know much less mileage than we do for tasks for closer like that sort of more leverage the full range of human like what the human brain does or like closer to task the human brain was designed for I think science is kind of and Technology a kind of intermediate place where they're still really really not close to what human brains are designed to do so you know it's not it's not that surprising if you can make brains that are really a lot better at science and technology than humans are I think a priori it's like not that much more surprising for science and technology than it would be for chests okay I took us some part away from the core of this fast versus slow takeoff discussion what one part of your argument that I think isn't immediate isn't immediately obvious is that when you're saying in a sense that takeoff will be slow you're actually saying that dumb RAI will have a lot more impact on the economy and on the world and other people think why do you disagree with other people about that why do you think that earlier versions of machine learning could could already be having a transformative impact I think there's a bunch of dimensions of this disagreement I like interesting facts I think about the sort of effective altruism and at 50 communities is that there's a lot of agreement about or there's a surprising amount of agreement about takeoff being fast this is a really quite large diversity of views about why takeoff will be fast like certainly the arguments people would emphasize if you were to talk with them would be very very different and so my answer to this question is like different for different people I think there's this general one general issue it is like I think other people more imagined so other people look at the evolutionary record and they more see like this transition between like lower primates and humans where humans seem incredibly good at doing kind of like a reasoning that builds on itself and discovers new things and accumulates them over time culturally they more see that as being like this jump that occurred around human intelligence and it's likely to be recapitulated in AI I think I'm more to see that jump as occurring when it did because of the structure of evolution so this evolution was not really trying to optimize it was not trying to optimize humans for cultural accumulation in any particularly meaningful sense it was trying to optimize humans for the suite of tasks that primates are engaged in and kind of incidentally humans became very good at cultural accumulation and reasoning I think if you optimize AI systems for reasoning and it appears much much earlier right if evolution had been trying to make guys that would build a civilization or evolution have been trying to design creatures trying to optimize for creatures that would build a civilization instead of like going straight to humans who have some level ability at forming a technological civilization it would have been able to produce crappy or technological civilizations earlier so I think and I think it's probably not the case that I feel like boules left monkeys for long enough you would get a spacefaring civilization I think that's not for reasons that are directly I think that's not a consequence of monkeys just being too dumb to do it I think it's largely a consequence of like the way that monkeys social dynamics work like the way that imitation works amongst monkeys the way the culture accumulation works and how often things are forgotten and so I think that this discontinuity that we observe in the historical record between lower primates and humans I don't feel like it's it certainly provides some indication about what kind of change you should expect to see in the context of AI but I don't feel like it's giving us a really robust indicator that it's a really very closely and knowledge so that's that's one important difference there's like this jump in the evolutionary record I expect the like to extent there's a similar jump we would see it significantly earlier and we would jump to something significantly dumber than humans and it's a significant difference I think between my view and the view of some think I don't know maybe one third of people who are who think takeoff is likely to be fast there are of course other differences so in general like I look at the historical record and I think it feels to me like there's an extremely strong regularity of the form before you're able to make a really great version of something you're able to make a much much worse version of something so for example before you're able to make a really fast computer able to make a really bad computer before able to make a really big explicit you're able to make a really crappy explosive that's like unreliable and expensive before I were able to make a robot that's a bolts like do some very interesting tasks are able to make a robot which is able to do the task with lower reliability or a greater expense for in a narrow range of cases that seems to me like a pretty robust regularity it seems like it's most robust in cases where like the metric that we're checking is something people are really trying to optimize so if you're looking at a metric that people aren't trying to optimize like how many books are there in the world like how many books are there in the world is a property that changes discontinuously over the historical record I think the reason for that is just because no one is trying to increase the number of books in the world it's kind of incidental like there's a point in history when books are a relatively inefficient way of doing something and you switch to books being an efficient way to do something and the number of books increases dramatically if you look at a measure people are actually trying to optimize like how quickly information is transmitted how many facts the average person knows it's a trap not the average person but how many facts someone trying to learn facts knows those metrics aren't gonna change just continuously in the same way that like how many books exist will change I think how smart is your AI is the kind of thing that's not gonna change like that's the kind of thing people are really really pushing on and caring a lot about how economically valuable is urea and so I think that this historical regularity like probably applies to the case of AI they're like a few plausible historical exceptions I think the strongest one by far is like the nuclear weapons case but I think that that case first is like there are a lot of very good opera arguments for discontinuity around that key that are much much stronger than the arguments we give for AI and even as such I think the extent of the discontinuity is normally overstated by people talking about the historical record that's a second kind of disagreement I think like a third kind of disagreement is I think people make a lot of sloppy arguments or arguments that don't quite work I think they're like I feel like a little bit less uncertain because they feel like it's just a matter of if you work through the arguments they don't really hold together so I think an example of that is like I think people often make this argument of imagining your AI as being like a human who makes mistakes sometimes there's like some epsilon fraction at the time or like fraction of cases where AI can't do what a human could do I hears like decreasing epsilon over time until you've hit some critical threshold where now your ad becomes super useful like once it's reliable enough like when I get flex zero mistakes or you know 1 million mistakes I think that model just like there's not actually it's kind of looks a priori like a reasonable ish model but don't actually think about it like your is not like a human that's degraded in some way like you're taking human you degrade them there is a discontinuity as it gets like really low levels of degradation but in fact your AI is like following along a very different trajectory the conclusion from that model turn out to be like to be specific to the way that you were thinking of AI is like a degraded human there's are three classes of disagreements so let's let's take it as given that you're right that that an AI take off will be more gradual than than some people think although I guess still very fast by by human timescales what kind of strategic implications does that have for you and me today trying to make that transition go better I think the biggest strategic question that I like think about regularly that's influenced by this is to what extent early developers of AI will have a lot of leeway to do what they want with through the air that they built like how much advantage will they have or the rest of the world so think some people have a model in which early developers of AI will be at a huge advantage they can sort of take their time or they can be very picky about how they wanted to play their AI and nevertheless like to radically reshape the world I think I think that's conceivable but it's much more likely that the early developers a day I will be developing in a world that already contains quite a lot of AI that's almost as good and they really won't have that much breathing room like they won't be able to reap a tremendous profits they won't be able to be really thinking about how they use their AI you won't be able to like take your human-level AI and like send it out on the Internet's like take over every computer because like this will occur in a world where like all the computers that were easy to take over her already been taken over much in Demery eyes like it's more like your existing in the soup of a bunch of very powerful systems you can't just like go out into a world like people imagine something like the world of today and human-level AI like venturing out into that world you know in that scenario you're able to do an incredible amount of stuff you're able to like basically steal everyone's stuff if you want to steal everyone's stuff you're able to win a war if you want to win a war I think they're like that model so that mall walking is less likely under slow takeoff though it still depends on quantitatively like exactly how slow and it especially depends on like you know maybe there's some way if a military is to develop AI in a way where they like selectively they can develop a guy in a way that would increase the probability of this outcome if they're really aiming for this outcome of having like a decide strategic advantage if this doesn't happen if the person who develops that doesn't have this kind of leeway then they're like I think the nature of the safety problem changes a little bit so in one respect take harder because now you really want to be building an AI that can do well if you're not gonna be get to be picky about what tasks you're applying your AI to you need an area that can be applied to any task those hoon deny that can like compete with a world full of a bunch of other guys you can't just say like I'm gonna focus on those tasks clear definition of what I'm trying to do or I'm just gonna pick a particular task which is sufficient to obtain this as you can advantage and focus on that one you really have to say like based on the way the world is set up is a bunch of tasks that people want to apply to and you need to be able to make those AI safe so in that respect it makes the problem substantially harder it makes the problem easier in the sense that you do get a little bit of a learning period it's like as the eye ramps up people get to see a bunch of stuff going wrong we get to roll out a bunch of systems and see how they work and so it's not like there's this one shot there's this moment where you press the button and then your AI goes and it like either destroys the world or it doesn't it's more like there's a whole bunch of buttons every day you push a new button and like if you mess up then like you're very unhappy that day but it's not literally the end of the world until you like push the button like the 60th time it also changes the nature of like the policy or coordination problem a little bit so I think it tends to make the coordination problem harder and like changes your sense of exactly what that problem will look like in particular it's not it's unlikely just be like between two AI developers who are like racing to build a power plant and takes over the world it's more likely like there are many people developing AI or like you know not many it but like whatever let's say they're like a few companies developing AI which is then being used by a very very large number of people both like in law enforcement and in the military and in private industry and like the kind of agreement you want is the new agreement between like those players and so like again the problem is easier in some sense in that now the military significance is not as clear like it's conceivable like that industry isn't nationalized that this development isn't being done by military that it's instead be treated in like a similar way to other strategically important industries and then it's harder because like there's not just this one you don't like hold your breath until an AI takes over the world and everything changes you kind of need to actually set up some like sustainable regime where people are happy with the way a development is going people are gonna continue to think like engage in normal economic ways is there like as developing AI so in that sense the problem gets harder so in both problems some aspects of the problem but the technical and policy problems become harder some aspects become easier yeah that's that's a very good answer that other people would disagree with you though what do you think are the chances that you're that you're wrong about this and what's the counter-argument that gives you the the greatest concern yeah I feel pretty uncertain about this question maybe we could try and like quantify an answer to like how fast is take off by talking about like how much time elapses between certain benchmarks being met or like if you have a one-year lead in the development of AI how much advantage does that give you at various points in development I think - like when I break out like very concrete consequences in the world like if I ask like how what does that the person who develops AI will be able to achieve a decisive advantage for some operationalization at some point then like I find myself disagree with other people's probabilities but I can't disagree that strongly so you know maybe other people will assign like a 2/3 probability to that event and I'll assign like a 1/4 probability to that event which is a pretty big disagreement but certainly doesn't look like either side being like confidence let's say 2/3 versus 1/3 doesn't look like either side being like super confident in their answer and kind of everyone needs to be willing to pursue policies that are robust across that Unzen t I think the thing that makes me most sympathetic to the fast take-up view is not like any argument about a qualitative change around to mid level it's more an argument just of like look quantitatively about the speed of development and think about right if you were scaling up on this timescale like if every three months you were corresponding to a your eyes we're like equivalent to an animal with a brain twice as large it would not be many months between like hey I did seem kind of minimally useful and the eye that was like conferring of a side strategic advantage so there's just this quantitative question of exactly how fast is development and even if there's no qualitative change you can have development that's fast enough that like it's correctly described as a fast take off and like in that case the the do I've described of the world is like not as accurate it's we're more like in that scenario where the a developer like can just keep things under wraps during like these extra nine months and then if they'd like have a lot of leeway about what to do how strong do you think is the argument that people involved in AI alignment work should focus on the fast takeoff scenario even if it's less likely because they expect to get more leverage personally if that scenario does come to pass so I think that's a there's definitely consideration in a direction I think it tends to be significantly weaker than the focusing on short time there's like a similar argument for focusing on short timelines which i think is quite a bit stronger I mean I think that like so the way that argument runs the reason you might focus on fast lines or on fast takeoff is because like over the course of a slow takeoff there'll be lots of opportunities to do additional work experimentation if you're what's going on if you have a view or that work can just replace anything if you do now then like anything you do now becomes relatively unimportant if you've a viewer there's like any kind of complementarity between like work with you now and work that's done so imagine you have this like let's say 1 to 2 year period where people are like really scrambling where it becomes clear to many people that there's a serious problem here and we'd like to fix it if there's any kind of complementarity between the work we do now and the work that they're doing during that period then that doesn't really undercut doing work now so like I think that it's good like we can in advance do things like understand the nature of the problem like the nature of the alignment problem understand much more about how difficult the problem is set up institutions such that they're prepared to make these investments and like Hank those things like are maybe a little bit better and fast take off the world but like it's not a huge difference so I think it's like not more than like intuitively think it's not more than factor of two but I haven't thought that much about it might be like maybe someone wasn't that the short timelines thing I think is a much larger update yeah hope tell us about that just so if you think that Andy's surprisingly soon in general like what surprisingly soon means that many people are surprised and so they haven't made much investment it's like in those worlds there's a lot less much less has been done it's like if I was certainly if I was developing like 50 years like I do not think it's the case that the research I'm doing now could really very plausibly be relevant just cuz there's so much time that other people are going to have to rediscover the same things and like if you get a year ahead now that means maybe like five years from now you're 11 months ahead of where you would have been otherwise and like five years later you're like eight months of what head of where would've been otherwise and like over time the advantage just shrinks more and more I'm like if as developed in ten years then like something crazy happened people were completely like the world at large has like really been asleep at the wheel if we're gonna have human level and ten years and in that world like this is very easy to have a very large impact and of course if they had developed in 50 years it could happen that like people are asleep at the wheel in 40 years you can kind of just independently make those you can like invest now for the case that people are asleep at the wheel you don't really for closing the possibility of people being asleep at the wheel in the future if they're not asleep at the wheel in the future then like that we do now is like a much lower impact so it's mostly it's just a neglected nough sarg ument where like you don't really expect up your AI to be incredibly neglected if in fact like people with short timelines are rights like if the 15% in ten years thirty five percent twenty years is right then they eyes like i've absurdly neglected at the moment right in that in that world what we're currently seeing Mel's not like unjustified hype it's like desperately trying to catch up to what would be an acceptable level of investment given the actual probabilities we face now Ellie you mentioned that if you have kind of this two-year period where economic growth has really accelerated in a very visible way that people would already be freaking out home do you have a vision for exactly what that freaking out would would look like and what implications that that has so I think those different domains and different consequences in different domains amongst AI researchers I think are big consequences a bunch that a bunch of discussions that are currently like kind of hypothetical and strange like the way talk about catastrophic risk close by or talk about the possibility of a I have mustard in humans or talk about like Melissa students being made by machines like a bunch of those issues will cease to become like stop being weird considerations or like speculative arguments I will start being like this is basically already happening we're like really freaked out about where this is going or like we feel very visually concerned and so I think that's a thing that will have a significant effect on like both what kind of research people are doing and also like how open they are to various kinds of coordination this that's a very optimistic view and I think it's totally plausible but many people are much more pessimistic on that front than I am but I feel like if we're in this regime people will really be thinking about powerfully as the thing that's like clearly coming and they'll be thinking about catastrophic risk from AI is like even more clear than powerfully I just cuz we'll be living in this world where I is really you're already living in a world where stuff is changing too fast for humans to understand in quite a clear way in some respects our current world has that character and that like makes it lazier to make this case than it would have been 15 years ago but they'll be much much more the case in the future can you imagine countries and from holding computation ability because they don't want to help anyone else to to get in on the game so I think mostly I imagine like defaults is just asset prices get bit up a ton it's like it's not the award in horde competition so much is just like computers become incredibly expensive and that like flows backwards to like semiconductor fabrication becomes incredibly expensive IP at chip companies becomes relatively valuable that could easily get computed away so I I think like to first order like the sort of economic story is probably what I expect but then like I think if you try it like if you look at the world and you like have the magic asset prices in Sumeria raising by like a factor of 10 over the course of a few years or a year like I think that it's pretty likely that like the normal I think the rough economic story is probably still basically right but like markets or like the formal structure of markets is pretty easy to break down in that case it's like you can easily end up in the world where computation is very expensive but like prices are too sticky for like actual prices to adjust in the correct way so instead that ends up looking like you know computers are still somewhat cheap but now like effectively they're impossible for everyone to buy or like machine learning hardware is effectively impossible for people to buy at the like nominal price and that world might look more like people hoarding competition which I would say is mostly a symptom of like you know inefficient market world's it's just the price of your computer has gone up by an absurd amount because everyone thinks like this is incredibly important now and it's hard to produce computers as fast as people want them and an inefficient market world that may look like that like ends up looking like freaking out and like takes the forum partly of like a policy response instead of a market response and strategic behavior by militaries and large firms okay that has been the discussion of how fast or gradual this transition will be let's talk now about when you think this kind of thing might happen what's your best guess for yeah AI progress timelines so I normally think about this question in terms of what's the probability of some particular development by 10 or 20 years rather than thinking about like what does a median because those seem like the most decision relevant numbers basically a maybe one could also if you had very short timelines give probabilities on less than 10 years so I think that my probability for like human labor being obsolete within 10 years is probably something in the ballpark of 15% and within 20 years it's something in the ballpark of 35% and I would then have prior to human labor being obsolete you have some window of like maybe a few years during which stuff is already getting quite extremely crazy and like probably a X risk becomes a big deal like you know we can have permanently sunk the ship like somewhat before once two years before we actually have human labor being obsolete those are my current best guesses I prefer uncertain about like I have numbers offhand because I've been asked before but I still feel like very uncertain about those numbers and I think it's quite likely they'll change over the coming year not just because new evidence comes in but also because like I continue to reflect on my views I think there like a lot of people whose views I think are quite reasonable who pushed for numbers both higher and lower or like tell people making reasonable arguments for numbers both much like short timelines and that and longer timelines than that overall I come away like pretty confused with why people currently are as confident as they are in their views I think compared to the world at large like the view I've described as incredibly aggressive like incredibly soon I think compared to the community people who think about this a lot I'm like more like somewhere and and still on the middle of the distribution but amongst people was thinking it was respect maybe I'm somewhere in the middle of the distribution and like I don't quite understand why people come away like with much higher much lower numbers than that like I don't have a good it seems to me like the arrogance people are making on both sides like really quite shaky I can totally imagine that after doing like after being more thoughtful I would come away with higher or lower numbers but I don't like it I don't feel convinced the people who are much more confident when we're the other have actually done the crime analysis that I should defer to them on that side I also don't think I've done the kind of analysis that other people should really be deferring to me on there's been discussion of fire alarms which kind of indicators that you get ahead of time that you're about to develop a really transformative a I do you think that there will be fire alarms that will give us you know several years or five or ten years notice that this is going to happen and and what might those alarms look like so I think that the answer to this question depends a lot on there's like many different ways that I could look different ways that a I could look have different signs in advance so I think if I has developed very soon say like within the next 20 years I think the best single guess for the way that it looks is like a sort of techniques that we are using are more similar to evolution than they are to learning occurring within like a human brain and like a way to get indications about where things are going is by comparing how well those techniques are working to how well evolution was able to do with different levels of like different computational resources so on that perspective or like in in that scenario I think is the most likely scenario within 20 years I think the most likely fire alarms like successfully replicating the intelligence of lower animals things like right now we're kind of at the stage where as systems are like the sophistication is probably somewhere in the range of insect abilities that's like my current best guests and like very uncertain about that I think like as you move from insects to like small vertebrates to like larger vertebrates up to mice and then birds and so on like I think it becomes much much more obvious like it's easier to make this comparison and the behaviors become like more like qualitatively distinct I'm also just every order-of-magnitude gets me like an order of magnitude closer to humans so I think before having broadly human-level AI a reasonably good warning sign would be like a broadly lizard level broadly mouse level AI that is like learning algorithms which are able to do about as well as a mouse in a distribution of environments about as broad as distribution environments that might serve off to handle I think that's a bit of a problematic alarm for two reasons one the like it's actually quite difficult to get a distribution of environments as broad as the distribution that a mouse faces so there's likely to be remaining concern like if you can replicate everything I mouse can do in a lab that's maybe not so impressive it's very difficult to actually test for some distribution of environments is it really flexing like the most impressive Mouse skills I think that won't be a huge problem for people like for each person a very reasonable person looking at the evidence I think will still be able to get a good indication but it'll be a huge problem for like establishing consensus about what's going on that's one problem and the other problem was this issue I mentioned where like it seems like transformative impact should come significantly before broadly human-level AI so I think that a mouse level AI would probably not give you that much warning or broadly Mouse level AI would probably not give you that much mourning and so you need to be able to look like a little bit earlier than mice it's plausible that like in fact one should be regarding like one should really be diving into the comparison to its now and say look can we really do this and like it's possible to me that that's like the kind of like if we're in this world where our procedures are similar to evolution it's plausible to me the insect thing should be like a good indication or like one of the better indications that we'll be able to get in advance there was this recent blog post that was doing the rounds on social media coordinator winter is coming which was broadly making the argument that people are realizing that current machine learning techniques can't do the things that people have been hoping that they'll be able to do a last couple of years that the range of situations they can handle is much more limited and that this the author expects that the economic opportunities for them are going to dry up somewhat and in an investment will shrink as we've seen so so they claim in the past when there's been a lot of enthusiasm about AI and it hasn't actually been able to to do the to do the things that were claimed do you think there's some much chance that that's correct and what what's your what your general take on this AI boom AI winter view so I think that the the position outposts are like the somewhat I feel like the position that post is fairly extreme in a way that's not very plausible sorry for example I think the author of that post is like a pessimistic about self-driving cars actually working because they won't be sufficiently reliable I think they'll like it's correct to be like this is a hard problem I think that like I would be extremely happy to take a bet at pretty good odds you can select the world they're imagining so I guess I I also feel somewhat similarly about robotics at this point like I think what we're currently able to do in the lab is like approaching good enough that industrial robotics can they have some big if the technology is able to work well it's a lot of value I think worry we'll do in the lab is like a very strong indication that like that is going to work in the reasonably short term so I think those things are pretty good indications that let's say like current investment in the field is probably justified by or like you know the level of investment is plausible given the level of applications that we can foresee quite easily though the form I don't want comment on the form of investment there's like maybe a second so I don't yeah I think I don't consider like the argument in the post thank parents in the post are like kind of wacky and like not very cool I think one thing though like it makes it a little bit tricky is this comparison like if you're doing if you compare the kind of our building now to human intelligence I think like literally until the very end actually probably after the very end you're just gonna be like look there's always things humans can do there are algorithms can't do I think like one problem is that's just kind of a terrible way to do the comparison like that's the kind of comparison it's predictably going to leave you like being really skeptical until like the very very end I think there's another question which is like at NIT is actually what they were getting at which is um this is a sense maybe amongst certainly deep learning true believers at the moment you can just like take existing techniques and scaled them quite far like if you just keep going like things are gonna keep getting better and better and we're gonna get all the way to powerfully I like that I think it's like a quite interesting question whether that is like if we're in that world then we're just gonna see machine learning like continue to grow so then we would not be in a bubble we would be in like the beginning of like this ramp up to spending some substantial fraction of GDP on machine learning that's one possibility other possibility is that like some applications are going to work well so maybe we'll get like some simpler abates applications working well which could be large that easily have impacts in like you know hundreds of billions or trillions of dollars but like things are gonna dry up long before they get to human level I think that seems quite conceivable I would maybe be like maybe I think it's like a little bit more likely than not that like at some point things pull back its like somewhat less than less than 50% that like the current wave of enthusiasm is going to just continue going up until we build human-level AI but I also think that iPod that's like kind of plausible I think people are like they really want to like call bubbles in a way that like results in a lot of irrationality anything Scott Sumner writes about this a lot and I like mostly agree with his take like when into sounds about something gets really high that doesn't mean it's like a guaranteed that it's gonna continue going up it can just be a bet that like there's a one-third chance that's gonna continue going up or one half chance and I think like people are yeah they're really happy about being self-satisfied after calling a bubble after calling like a level of enthusiasm that's unjustified sometimes they're right ex-ante and like the tech there's some people who are writing like sometimes this calls a right ex-ante makes it a lot more like attractive to take this position i think a lot of the time like you know ex-post it was fine to say like this was a bubble but ex-ante i think like it's worth investing a bunch on the possibility that like something is really really important i think that's kind of where we're at i think that the arguments people are making that like deep learning is doomed are like mostly pretty weak for example cuz they're like comparing deep learning to like human intelligence as is like not the way to run this extrapolation like the way to run the extrapolation is to think about how tiny existing models are compared to the brain think about like on the model where we'll be able to do a brain in 10 or 20 years what she would be able to do now and actually make that comparison instead of trying to say like look at all these tasks humans can do what kinds of things should people do before we have an artificial in general intelligence in order to I guess protect themselves financially if they're potentially going to lose their jobs is that is there really anything meaningful that that people can do to show themselves from potentially negative effects so if it's like the world continues to go well so if all that happens is that we build AI and it just works the way that it would print efficient market world's there's no like crazy turbulence then the main change is you shift from like having so currently like two-thirds of GDP gets paid out roughly as income I think if you have a transition to human labor being obsolete then you fall to like roughly 0 of GDP as paedos income and all of it is paid out as returns on capital so from the perspective of a normal person like really like you sort of either want to be benefiting from capital indirectly like living in a state that like uses capital to fund redistribution or you just want to have some savings like there's a question of how you'd want Slyke the market is not really anticipating AI being a huge thing over 10 or 20 years so you might want to like further hedge and say like if you thought this was pretty likely then you might want to take a bet against the market and say like invest and stuff that's gonna be extra valuable in those cases I didn't like mostly like the very naive guesses like not a crazy guess try to do that like investing more in tech companies I like I'm pretty obviously about investing in semiconductor companies chip companies seem reasonable a bunch of stuff that's complementary to AI is going to become valuable it's like Natural Resources has been up again in efficient market world the presidential resources it's like one of the main things that benefits as you make human labour really cheap you just become limited on resources a lot people who own like like Amazon presume the benefits a huge amount people who run logistics people who run manufacturing etc I think they like generally just owning capital seems pretty good unfortunately right now is not a great time to be investing but like still I think that's not a not dominant consideration when determining how much you should save you say it's bad just because the stock market in general looks like valued based on like Preston Dinks ratios yeah it's hard to know whatever value means exactly but certainly like yeah it seems reasonable think of it in terms of like if you buy a dollar of stocks like how much earnings are there to go around for that dollar of stocks and like it's pretty low pretty unusually low this might be how it is forever I guess like if you have the kind of view that I'm describing like if you think we're going to move to an economy that's growing extremely rapidly then you kind of have to bet that like the rate of return on capital is gonna go up and so it's kind of in some sense you need to invest early and because you want to actually be owning physical asset since that's where all the value is going to accrue but like it's a bummer to lock in like relatively low rates of return in the in the normal scenario where that doesn't happen no in the even in like suppose I'd like make some of the loan yeah like away people norm how cap will be like making a loan you make a loan now you make a loan at like 1% real interest for 20 years you're pretty bummed if then people develop the AI and like now the economy's growing like by 25 percent a year like you're 1% a year alone is like looking pretty crappy right yeah and you're like pretty unhappy hold that's the stocks are a little bit better than that but it depends a lot on like you Stockstill take a little bit of a beating from that so I think this general is like a consideration that undercuts like the basic I think the basic thing you would do if you expected AI would be like save more like one more capital if you can I think that's like undercut a little bit by like the market being price if that's hard which could be a bunch of people doing that that's not why it's happening like prices aren't being bit up because everyone is reasoning in this way prices are being bit up just because of like unrelated cyclical factors let's talk now about some of the actual technical ideas you've had for how to make machine learning safer one of those has been called iterated intelligence distillation and amplification or and sometimes abbreviated as as I da what is that idea in a nutshell I think the starting point is realizing that it is easier to train an AI system or it currently seems easier to train an aligned AI system if you have access to some kind of overseer that's smarter than the AI are trying to train it's like a lot of the traditional arguments about why alignment is really hard why the problem might be intractably difficult really push on the fact that you're trying to Train like say a super intelligence and you're just a human and similarly like if you look at existing techniques if you look at the kind of work people are currently doing and like more mainstream alignment work it's often kind of implicitly predicated on the assumption that there's like a human who can understand what the AI is doing or there's a human who can behave kind of close to approximately rational or human who can evaluate how good the AI systems behavior is or human who can like peer in it what the AI system is thinking and makes sense of that decision process and sometimes this dependence is a little bit subtle but it seems to me like it's extremely common a given when people aren't acknowledging explicitly a lot of the techniques are gonna have a hard time scaling to domains where they as a lot smarter than the overseer who's training it so I can motivated by that observation you could say let's try and split the alignment problem into like two parts one of which is try and train an aligned AI assuming the of an overseer smarter than that AI and the second part is like actually produce an overseer who's smart enough to use that process or smart enough to train that AI the idea in iterative amplification is to start from a weak AI at the beginning of training you can use a human like a human is smarter than your AI so they can train the system as the AI acquires capabilities that are like comparable to those of human then the human can use the AI that they're currently trained as an assistant to help them act as a more competent overseer so over the course of training you have this AI that's getting more and more competence and the human at every point in time uses several copies of the current AI as assistants to help them make smarter decisions and the hope is that that process both preserves alignment and allows this overseer to always be smarter than the ad that they're trying to train and so the the key steps in analysis there are both solving this problem the first problem mentioned of like training in AI given the you have a smarter overseer and then actually analyzing the behavior of the system consisting of a human plus several copies of the current AI acting as assistants to the human to help them make good decisions and so in particular as you move along the training like by the end of training the humans role becomes kind of minimal like if we imagine Turing super intelligence and in that regime we're just saying can you somehow put together several copies of your current AI to act as the overseer so you have this AI like trying to hopefully at each step it remains aligned you put together a few copies of the AI to like act as an overseer for itself how is it that the kind of training a I and the human are gonna be smarter than the AI that they're that they're trying to train which I mean I mean they're trying to make something that's smarter than them right but you're saying that that as a that in combination they can do a better job yeah so the key step there the reason why it's plausible like let's suppose we're we're trying to train our AI it's at some level of capability suppose that you imagine like the I guess to think for like two seconds before a decision it makes then when in this oversight process we have multiple copies of the each thinking for two seconds so if you imagine this is like an analogy to humans you have like one smart human is making a decision and they're being overseen by like a team of like 20 slightly less smart humans and when I say slightly I really means slightly like they're one gradient step behind so the hope is that like and this is a complicated question or like an open question at the moment if you have this team of like 20 very slightly dumber aligned agents then can they sort of divide up the work of overseeing this AI and so that they can yeah so that even though they're a tiny bit dumber by working together they can do it okay so is the idea that yeah by throwing more processing power or more instances behind the somewhat dumber algorithm they'll be smarter than the AI that they try to train and therefore able to keep it under control and give it good instructions in concert with the human that's roughly right yeah so in particular when we asked like what does this team have to do we have this overseer which is now working together what does the team have to actually do there may be like two big classes of tasks they have to do so one is they're gonna like provide the incentives to the AI so this team is gonna like look right our ads gonna take some action then our team of oversee areas is going to evaluate that action and decide how good it was and that's gonna be the reward signal that we used to train our AI so in order for that to work this team of twenty eyes needs to be able to make a better judgment about how good actions are then the ad that they're trying to train right because they can't they sort of using this process can't really train an ad to make smarter decisions then that team would make and so there we hope that you can sort of break down there's this question like how good an action was X and X you hope you can like break down the problem of evaluating that action into a bunch of tasks each of which are a little bit easier you can sort of split up those tasks across this team so that's one of the tasks that the team does feel to do this evaluation and cities evaluation questions so the second category that we don't understand as well which is like they would like to ensure that the I their training reliably does the thing they're trying to incentivize it to do so for example or an example of a technique I'm called adversarial training that you could use for this problem is if I'm training some a I I'm concerned that in some situations it might behave badly I can try and synthesize situations I can try and construct handcrafted situations designed to look like our situations where the AI might behave badly and I came instead of running this AI just on real examples I also run it on these synthetic examples designed to cause problems on so in order to do that I need to be able to do tasks like look at the AI and figure out like whether the synthetic what are the cases where it would perform poorly and then also evaluate in such a case like did it do something unacceptable so I think it's not crazy to think about this with an analogy to humans so you again you have your team of twenty humans you have one like slightly smarter human who maybe is like you know you're concerned that under some conditions they're gonna try and like break this process they're gonna try and go like go rogue do something bad and like this team of twenty humans is responsible for like interrogating this human and like making sure there's not a condition under which it's going to behave poorly it's like giving a test cases that hopefully it can't tell the human being train can't tell from like the real situations it's going to encounter that's an example of a technique that you could use for this problem of reliability there are like other possible techniques you could use and those are the two those are the two big ask this team of humans needs to be or this team of AIS needs to be able to solve there's a team of humans in the analogy but this team is to be able to evaluate behavior to produce good incentives to train the AI to do what the team wants us to do and it needs to be able to do some evaluation for reliability which is not something we currently understand well so at its core you're going to try to get somewhat dumber AIS and humans together to come up with a training process by which they figure out whether this smarter AI that they're trying to develop is behaving in the way that they want by designing particular scenarios to test whether that's the case and even though they're like not quite as smart I guess in in in this model because they're because you're throwing quite a lot of power behind that somewhat simpler task of just evaluating whether it's doing the right thing you hope that that way you'll be able to gradually scale up and and not lose alignment at any particular point in time yeah that's right so I guess it's worth pointing out again that like generating scenarios that's like one possible way of testing like trying to get the system to behave robustly like robust to do the right thing there are other possible approaches it's like you could also try and use an approach where you like open up the brain of this AI or trying to train so use something like interpretability techniques that people are currently working on to understand like how it's thinking and say ah now that I understand how it's thinking I see that like here's the place where it's thinking in a way that I wouldn't have wanted to be thinking and like I can tell from that that it will fail in this scenario or I can just directly say like no that was not a good way to be thinking about the problem and penalize that one of the major things this group is doing is like just determining incentives for they either training so this team of slightly Dumber humans is just determining what constantly they're like evaluating the AI on realistic examples like on examples that appear in the real world and saying how good was its behavior in this case those were in that case and as being trained to maximize those evaluations yeah okay so by incentives you mean kind of do we do we give it its reward do we give it whatever it's been programmed to try to get yet so I mean formally it's like you would really be using gradient descent where you're like yeah we take REI we take this evaluation that this team is providing and then we like to modify the ad very slightly so that it gets a slightly higher reward on that it gets a slightly higher evaluation or outputs actions that have higher evaluations on average and that's setting like actually the AI that you're starting with is exactly the same as the eyes who are on this team doing the oversight but like after you make this very small perturbation that perturbation now hopefully gives you a yeah that's very slightly smarter than the ads on the team it's the ad is actually thinking is like exactly as smart as the ones on the team it's only as you can are these possible perturbations that you hope that the perturbations are like Epsilon smarter and that's that's not training would normally work we'd have like some evaluation consider a I run it perturb it to give slightly better performance repeat okay yes someone someone emailed me about my da wanting me to ask you about it and and said the context here is that I and many others think that idea is currently the most promising approach to solving the alignment problem largely because it's the only real actual proposal anyone has made do you think that's right and and more generally what's been the the reaction to this general approach yeah and so I would say the current situation is I am very interested in really asking what solutions would look like in like as you scale them up like what is our actual game plan what is the actual endgame here that's a question that relatively few people are interested in and so very few people are working on nearly like the machine intelligence Research Institute is very interested in that question but they part ways with me by believing that like that question is so obviously impossible that it's like not worth thinking about directly instead we should be trying to improve our understanding of like the nature of rational agency so that's the reason why they are not in the business of trying to produce concrete proposals it just seems doomed to them feel so then where's patching holes and a thing that's fundamentally not gonna work most people in the broader ml community like I would say they've taken a - that's more like we don't really know how the system is going to work until we built it so it's like not that valuable to think about in advance like what is the actual scheme gonna look like and so that's the difference there I think that's also true for many safety researchers who are like most or like more traditional AI rml researchers like they would more often say like look I have a kind of general plan I'm not gonna think in great detail about what that plan is going to look like because I don't think that thinking is productive but I'm gonna try and like vaguely explain the intuitions like maybe something like this could work I think it sort of happens to be the case of like basically no one is engaged in the project of like actually say here's what the idea might look like I'm trying to aspire to the goal of like actually write down a scheme that could work there are a few other groups they're also doing that I guess so at the open the eye safety team we also recently published this paper on safety via debate which I think also has this form of like being a candidate like an actual candid solution or something that's aspiring to be a scalable solution to this problem I'm so Geoffrey Irving was lead author on that he's like a colleague on the opening safety team I think that's like coming from a very similar place and like maybe isn't some sense is a very similar proposal I think it's very like that either both of these proposals work or neither of them works so in that sense they're not really like totally independent proposals but they're like getting it yeah they're really pushing on kind of the same facts about the world let you make AI both them are leveraging AI to help evaluate your AI I think the other like a big category is work on inverse reinforcement learning where people are attempting to sort of invert through human behavior and say given what human did here's what the human wants given what the human wants we can come up with better plans to get with human watts and maybe that approach can be scalable I think the current state of affairs on that is there like some very fundamental problems with making it work like with scaling it up related to like how do you actually define what it is that human wants like how do you relate the human behavior to human preferences given though like humans aren't really the kind of creature that actually has like there's a slot in human brain that's like we put the Preferences I think unfortunately we haven't made super much progress on like that like sort of core of the problem or like from the what I would consider the core of the problem I think that's related to like people in that area like not thinking of that as being their current primary goal that is they're not really in the business of like saying and we're gonna try and write down something that's just gonna work no matter how powerfully I guess they're more in the business saying like let's understand like clarify the niche for the problem make some progress try and get some intuition for like what will allow us to further progress and like how we could get ourselves in a position as AI improves we'll be able to adapt to that well you think it's not a crazy perspective I think that's how we come to be in this place we're like they're very very few concrete proposals that are aspiring to be like actual scheme you could write down and then run with AI and would actually yell the line day I I think like overall reaction is like there's kind of two there's two kinds of criticisms people have one of which is this problem seems just completely hopeless so there's a few reasons people would think that this iterate amplication approach is completely hopeless there normally can be divided roughly into thinking that like organizing a team of twenty guys to be aligned and smarter than the individual guys sort of already subsumes the entire alignment problem like in order to do that you would need to understand like how to solve alignment in a very deep way such that if you understood that there'd be no need to do any of this bother with any of the other machine the second common concern is that like this robustness problem is just impossibly difficult so in this iteration for Kishin scheme as we produce in di we need to verify not only do we need to incentivize the ad to do well on the train distribution we also need to sort of restrict it to not behave really badly off of the train distribution and like there are a bunch of possible approaches to that that people are currently exploring the machine learning community but it's I think the current situation is we don't see a fundamental reason that's impossible but it currently looks really hard and so it's many people are like suspicious that problem might be impossible so that's one kind of a response is this like maybe the problem like it or amplification cannot be made to work the other kind of response is it's reasonably like AI safety is sort of easy enough that we also don't need any of this machinery that is you know I have described this procedure for trying to oversee AI systems that are like significantly smarter than humans many of the problems on this perspective are only problems when you want things to be scalable to very very smart AI systems you might think instead like look we just want to build in the eye that can like take one quote like pivotal act there's an expression people sometimes use for like an action that I could take that would substantially improve our situation with respect to the alignment problem so I say we want to build an AI which is like able to safely take like one pivotal act that doesn't require being radically smarter than human or taking actions that are not understandable to a human so we should really not be focusing on or like thinking that much about like techniques that work in this like weird extreme regime I guess even people like in the broader ml community would say like look I don't know like they don't necessarily buy into this framework of I can just take a pivotal act but they would still say look you're worrying about a problem which is quite distant it's pretty likely that for one reason or another if that problem is going to be very easy by the time we get there or the like one of these other approaches we can identify is just going to sort of turn out to work fine I think both those reactions are quite common I think there's also a reason it'll be a big crowd of people who are like yeah I'm really interested it's like come from a similar perspective to me where they really want like a concrete proposal that they can actually see how it could work I think they like those people tend to be well for those who aren't incredibly pessimistic about this proposal many of them are pretty optimistic about like entry amplification or a debate or something along those lines that's a great answer yeah I think it's really credible that you actually try to put out ideas for how we could deal with this and I've seen as you say very few other people actually how to do that and people can just read those ideas for themselves and your eye alignment blog on on medium you mentioned there another approach that people have been talking about recently which is the debate as a way of aligning AI you also mentioned in inverse reinforcement learning but we discussed that in the episode last year with Dario EMA day so we'll skip that one but can you just describe the approach in in in the debate paper which which is somewhat similar it sounds like too tidy a yeah so the idea is we're we're interested in training ed systems to make decisions that are in some respects too complicated for human to understand it's worth pointing out that problem can appear long before a eyes broadly human level because as capabilities are very uneven so it can like have understanding of a domain that are way beyond human understanding of that domain even while being subhuman in many respects we want to train this AI to make decisions that are too complex for human to understand we're wondering how do you get a train signal for such an act one way when approached people often take us like pick some actual consequence in the world like some simple consequence in the world that you can optimize like you know whatever you're running a company just I don't care how you're making decisions about that company all I care about is they lead to the company having high profit we're interested in moving away from there like I think those serious concerns with that from a safety perspective we want to move more towards the regime where like instead of evaluating like yes this decision had good consequences but I understand why we're evaluating a proposed decision than saying yeah we understand that's a good decision so we're gonna give it a high reward because we understand why it's good that approach has I mean if an AI comes to and says I would like to design the particle accelerator this way because and then makes you an inscrutable argument about physics you're kind of faced with this tough choice like you can either sign off on that decision and just see if it has good consequences or it can be like no don't do that because I don't understand it but then you're gonna be like sort of permanently foreclosing some large space and possible things here I could do so instead the proposal is we're gonna have to a as one a is gonna make a proposal we can't directly like that puzzle depends on a bunch of complicated facts we don't necessarily understand like it's gonna make some complicated argument about the economy in order to justify that puzzle and we couldn't actually evaluate that argument but if we introduced this adversarial agent who can explain to us like why the proposal that was made is bad and then the original agent right if this critique has a flaw the original agent can say like no that critique is not like a valid critique because and then point out the flaw and then the critique er can say no like was valid so they can go back and forth in this way then you can kind of implicitly explore an exponentially large space of considerations right because like by giving the critique or the option to pick any line of argument that they want in order to attack the proposal you can sort of verify that every possible line of argument if the criteria is not able to win it suggests you that every possible line of argument would have been unsuccessful every possible line of argument would have still left you thinking the proposal was a good one so it's not clear if you can actually construct so now we have some complicated question are you guys proposing to us in action we would like to set up the debate such that the best action will actually win the debate so if two eyes propose actions and one of them is posing an action which is actually better then it will be able to win a debate in which it establishes the detection is better I think there are some plausibility arguments like the one I just gave you're exploring an exponentially large space of considerations that this might be possible in cases where a human couldn't have any idea about the task itself or a directly answering the question it's a very open question exactly how powerful is debate that is if we set up a debate in the best possible way so we give it we have some human judge of this debate who's evaluating the claims and counterclaims if we give them like optimal training and optimal advice and then we have two very powerful agents debate in this way we'd like it to be the case that the optimal strategy in that debate is being honest and actually telling the truth and then actually providing with valid arguments for that and responding to counter arguments in a valid way I think we don't know if that's the case but figuring out if that's the case and then like understanding in what cases were able to run such debates and it converges to truth understanding how to set them up so they converge to truth etc doesn't give like a plausible way of training very powerfully assistants so how analogous is this approach to a case where say a person like me is trying to judge difficult scientific issue and I wouldn't be capable of doing the original research and figuring out the truth myself but if there was you know scientists debating back and forth and one of them maybe was trying to be misleading in some way but and another one was but was being truthful the hope is that I would be able to figure out which one was telling the truth because I can I can at least evaluate a bit but abate even if I couldn't produce the arguments myself yes I think the situation is pretty analogous to two human experts with lots of understanding you lack you're trying to understand the truth you hope that if one of those experts is trying to make a claim that is true then by like zooming in on one consideration after another you could find out that it's true you eventually come to become very skeptical of all the counter-arguments where they could undermine all the counter-arguments they were offered and so on something that's like it's definitely not an obvious claim it's not obvious in the context of human discussions I think like as a society we don't empirically there aren't great examples of covering really big gaps in expertise like it's often the case that two people with expertise in area can have a debate in a way to come into someone with like slightly less expertise when there's really large gaps I don't have a very good record of doing that kind of thing successfully I'd say there's like more hope that this is possible then that a human could just directly evaluate some proposal produced by a sophisticated AI system but it's still very much an open question whether this kind of thing can actually work and one way you could trying to assess that would be say we're gonna like get fairly serious about like have some serious experiments of trying to take people with considerable expertise in an area I have them have a debate arbitrated by someone with less expertise what do you think is the is the biggest almost concerning criticism of a kind of is safety via debate at says personally I think the worst problem is just this like basic question two debates tend to favor accurate answers or do they tend to favor answers that are easy to defend for reasons other than their accuracy there's a bunch of reasons the debate might favor an answer other than it being accurate so I think one that really leaps to people's mind is like well the judge is just a human even's have all sorts of biases and inconsistencies that's one reason that debate could favor answers other than the accurate one I'm sort of more personally concerned about maybe an even more basic question which is like setting aside all human biases and like all ways in which humans like fail to reason well I think it's an open question does the structure of debate tend to promote truth like does it tend to be the case that there's some way to argue for the accurate position even if the content of the debate like the thing you're debating is really really complex compared to what the human could understand it seems like debate among humans is way better than random anyway I agree that humans are and clearly were able to get this in some cases able to get much better answers than we get on our own like if I get to observe two experts debate a subject even if one of them is actively trying to mislead me I can arrive at a better conclusion that I could have arrived at if I just like wasn't able to listen to their expertise or was only given like a single expert who's incentive was just so like look good I think that the example of debates amongst humans like makes it very unclear whether this procedure can be scaled arbitrarily far so like an example you might think of is like consider a human who's like a smart person who's like knows a lot about reason but has been like practice a lot of judging debates they don't know any calculus they're not judging a debate between like two quantum physicists on how to interpret the results of some experiment and a recent particle accelerator just imagining that process like I can see how it could work I can imagine it working it's like an incredibly intimidating prospect just like this person is not gonna understand anything about like over the course of the debate there's no way that they can form in their head like a reasonable model of calculus or of quantum mechanics or of the standard model and like yet you hope that somehow the way they're arbitrary in this debate can implicitly like answer extremely complex questions about the depend on all those areas so I think this is a kind of test that you can do empirical ii like we can ask them to have a question for a human who's like very good at you know a very smart person who's been trained to judge such debates and then you have like two people with a ton of expertise in an area they've never thought about to come in one of them trying tick and it's the truth and when trying to mislead them like is it the case empirically that humans can arbitrate such debates and actually like that the best way to win such a debate is to provide true facts about the domain to the human I think they're like if that's the case I think it's actually you know if that's the case it's a very interesting fact not just for this like not just for the purpose of training AI but just a general I think it professors it's a really important question about the world like are there norms of debate that allow you to consistently arrive with the truth in domains where the arbitrator like doesn't understand what's true that's like a question that's irrelevant to a ton of domains this version of the question is like a little bit its distinctive in some respects I think mostly this distinctive because like we are free to set things up in the way that's like maximally convenient so it's kind of absolutely kind of the best possible conditions can debate be conducive to truth whereas like most debates of the current society a current or like pretty highly suboptimal conditions of like very limited time bad incentives on behalf of the judge judges sampled from some populations like doesn't have a lot of time to think about how to judge such debates well or like isn't that like hasn't thought a lot about kind of structure this to lead to truth so I think like most of its in the real world like our under create pessimistic conditions but just understanding like when does debate work or lies the equilibrium debate truth I think it's like I really I don't know I would consider that like a really fundamental interesting question completely independent of AI I think it's like now also a particularly important question because it really is closer like one of the most plausible strategies for training very powerfully eyes to like help us actually arrive at like good advice or good conclusions their other important pros and cons of this approach they're worth mentioning as I think there's definitely a lot to that could be said about it there are a bunch of other issues that come up like when you actually start trying to do machine learnings when you try and train agents to play this kind of game then there's lots of ways that that can be hard as a machine learning problem you would have lots of concerns in particular with the dynamics of this game and so some people like maybe wouldn't be happy with you like your training area has to be really persuasive to people hmm you might be concerned that makes some kinds of failure modes look more crop up in more subtle ways or be more problematic but I really think the main thing is just is it the case that a sufficiently sophisticated judge will be able you know every judge defines a different game like convincing me is a different game from convincing you I think it's probably it's clear that for weak enough judges this game isn't particularly the truth conducive there's no reason that the honest player would have an advantage hope is that there's some level of sufficiently strong judges above which it's the case that you converge over longliner debates to more accurate claims yeah it's unclear even so first question is there is such a threshold in a second question are humans actually above that threshold it wasn't the case so like if we have humans judge such debates they will actually have honest tragedies winning what kind of people do you need to pursue this research are there any differences compared with other other lines so again I think there's like a very similar there's a bunch of different questions that come up with for amplification and debate I think different questions require different kinds of skill of different backgrounds I think that both for implication and debate there is this like a more conceptual question or like I don't know if conceptual is the right word it's a fact both about like the structure of argument and about the actual like way the humans make decisions just like any humans arbitrate these debates and durman's where they lack expertise or in the amplification case can you have teams like the dressing some issue or no individual can understand the big picture and that I I mean there's a bunch of different angles you could take on that question right so you could take a more philosophical angle and say like what is the like what is going on there why should we expect this to be true or what are the cases that might be really hard you could also just run experiments something people which seems relatively promising but involves obviously a different set of skills or you could try and study it in the context of machine learning and goes straight for like training you might say well if we could test this in with humans if we'd very large numbers of humans then maybe actually the easiest way to test it is to be in the regime where we can train machine to effectively like yeah do things that are much much more expensive than we could afford to do with humans you could imagine approaching it from like a more philosophical perspective more like I don't know a cognitive science or just like organizing like maybe not even an academic perspective just uh putting together a bunch of humans and understanding how those groups behave or out like more machine learning perspective what's being the reception from other groups to this debate approach how so I think there's there are many different groups and different answers for different groups I would say that for many people like the nature of the problem or the alignment problem is like very unclear when they first hear it stated in an abstract way and so I think for a lot of people it's been very helpful to get a clearer sense of like what the problem is that they're trying to solve and like when you throw this proposal people both understand why debates are better than just giving an answer and then having human evaluated and they also can sort of see very clearly why'd there's difficulty like it's not obvious that the outcome of the debate is in fact producing the right answer so I think from that perspective it's been extremely helpful and having like people I think a lot of people have really been able to like get much more purchase understanding what like what the difficulties are and we're trying to do I think for people who are more and like the ml side again it's still been very helpful for having them understand what what we're trying to do but I think the ML community is like really very focused on like a certain kind of implementation and actually building the thing and so I think that community is mostly just sort of waiting to like that's very interesting research direction and then their responses like to wait until things either happen or don't happen like it's all we've actually built systems that embody those principles to do something which he wouldn't have been able to do without without that idea so if we can use this approach to go from having like 60% accuracy to 70% or 80% accuracy like how useful is that do we need to be able to judge these things correctly almost all the time or is it more just like the more often humans can make the right call the the better yes certainly if you just had a judge who is like correct but then 40% of the time they err randomly that would be totally fine because that's sort of gonna have a doubt and it's not a problem at all what you really care about is just like in what cases are there to what extent are there like systemic biases in these judgments so to accidentally just consistently make the wrong answer when the answer depends on certain considerations or in certain domains and so for that from that perspective like I guess the question is what class of problems can you successfully resolve with this technique and like if you push that frontier of problems a little bit further you can solve a few more problems now than you could before are you happy to kind of two attitudes you could have on this so one I guess the thing I really would like is a solution that just works in the sense that we have some principled reason to think it works it works empirically as we scale up our machine learning systems it works better and better and we don't expect that to break down that would be like really great and sort of ism from what theoretical perspective that's kind of what we'd like there's a second perspective you could have which is just there's a set of important problems that we want to apply machine learning systems to so like as we deploy ml systems we think the world might change faster or become more complex in certain respects and we really care about is whether we can apply machine learning to help us make sense of that kind of world or steer that world in a good direction and so from that perspective it's more like there's this list of tasks that are interested in it sort of the more tasks we can apply ml2 the better position we will be to cope with like possible disruption caused by ml and so from that perspective I think are just sort of happy every time you expand the frontier of tasks they're able to solve effectively I think like I also take that pretty seriously this what if it was the case that we could just push the set of tasks that were able to solve a little bit I think that would improve our chances of coping with things well a little bit but my main goal is probably or like the main focus I think as we are further away if we're like think having to think about things more conceptually or more theoretically then I think it's better to focus on like having a really solid solution that we think will work all the time as we get closer then it becomes more interesting to say like great now we see these particular problems that we want to solve let's just see if we can push our tunics a little bit so that ml systems can help us solve those problems do you think it's possible there's an advantage to whoever's trying to be deceptive in these cases that in fact it's easier for the person who's trying to mislead a judge because they can choose from a wide range of possible claims that they could make whereas the person or via the agent that's trying to communicate the truth that they can only make one claim which is the the true one yeah so I guess a few points maybe a first preliminary point is that in general there wouldn't be field to agents there wouldn't be like one assigns the truth and one assigned to lie instead they would both just be arguing whatever they thought would be most likely to be judged as you know honest and helpful so in a world where it worked like that there just knew the participant in the debate would be trying to say anything true both of them would be arguing for some garbage if we were in this unfortunate situation so then in terms of the actual question like yeah you could sort of imagine there was this giant space of things you could argue for one of them is like I know some tiny spaces and other things that would actually regard on reflection is like the most Ruthie and like all the other stuff yeah that's a very tiny tiny subset of the past claims and like there's a ton of other things that differ between different claims besides like how actually useful are they and truthful are they as I think a priori you definitely be very very it's like a very very surprising claim or very very special claim to say the very best strategy for amongst all these strategies is the one that's most truthful and helpful so I definitely think like your first guess you know just if you didn't know anything about the domain would be that there's gonna be some other properties you know like maybe how nice it sounds is like a very useful like you want to pick the thing that sounds nicest but the thing that has the slickest sound bite in its favor or something like that yeah I think I am reasonably optimistic that if say human judge is careful they can sort of judge well enough that they have some I'd say after if you're a weak judge you sort of this process can't really get off the ground you're like not able to add all correlate your judgments with the truth as you get to be a stronger judge you hope that not only can you start to answer some questions you can sort of bootstrap up to answering more and more complex questions so that is you could say well if I just were to guess something off the top of my head that has like one level of correlation with truth like an easy enough cases that's gonna be truthful then if I have a short debate that sort of bottoms out with me guessing something off the top of my head that can be like a little bit more conducive to the truth and now if I have like a long debate where like after that long debate I can now have a short debate to decide which side it then wins I think that's like more likely to be conducive to truth so you could hopefully you have sort of a limiting behavior as you think longer longer the like class of cases in which truthfulness becomes the optimal strategy grows but yeah I think it's not obvious what's the best source of people who want to learn more about this approach there's a paper upon archive and I think there's a blog post that came out after that that's perhaps more extensive I think the paper is probably the best thing to look at so as a paper in their cup comic saft via debate like it covers like a lot of considerations and raises a lot of considerations discusses possible problems discusses how it compares to amplification and things like that it presents some like sort of very simple toy experiments to show a little bit about how this might work in the context of machine learning it doesn't present like sort of convincing example of a system which does something interesting using debate and so that's like what we're currently that's Michael we're currently working on and so our reader who's looking for that should maybe come back in six months I think if you want to understand what is yeah if you want to understand like why we're interested in the idea or like what is basically going on then the hanka papers get what would prevent us from implementing either of these strategies today what advanced do we need to actually be able to put them into practice yes I think depending on your perspective either unfortunately or fortunately there's really like a ton of stuff that needs to be done so one category is just building up like the basic engineering competence to make these things work at scale so that is right like in running this process it's kind of like training in AI let's consider the debate case which I think is fairly similar in technical requirements but maybe bit easier to talk about we understand a lot about how to train a guys to play games well because that's a thing we've been trying to do a lot this has an example of a game has like many differences from the games people normally training eyes to play so for example like it is arbitrated by a human and queries to a human judge are incredibly expensive so that presents you with a ton of problems about like one organizing the collection of this data using like approximations like there's this whole family or approximations you're going to have to use in order to be able to actually train disease to play this game well you can't just like have every time they play the game of human actually makes the evaluation you need to like be training models to approximate humans you need to be using like less trusted evaluations you need to be like learning cleverly from data that's like just passive data rather than actually allowing them to query the judge that's like one yeah just like technically running this project of scale is hard for a bunch of reasons that AI is hard and then also hard for some additional reasons distinctive to the role of humans in these kinds of proposals it's also hard I guess as a game because it has like some features that games don't normally have so we wouldn't used to thinking of games with yeah we don't I get like there's other technical differences beyond the involvement of humans that like make these kind of hard engineering problems and like some of those are things that I'm currently working on just trying to understand better and try again try to build up the actual engineer expertise to be ready to make these things work at very large scale that's one class of problems the second class of problems is just figuring out like I think there's maybe two things you could want one is you want to be able to actually apply these schemes like you want to be able to actually run such debates and use them to train a powerful AI but then you also want to understand much more than we currently understand about whether that's actually going to work well so in some sense like even if there was nothing stopping us from running kind of training procedure right now we're gonna have to do a lot of work to understand whether we're comfortable with that but do we think that's good or do we think that like we should use some other approach or maybe like try harder to made to avoid deploying AI that's a huge cluster of questions some of which are like empirical questions about how humans think about things like what happens in actual debates involving humans what happens if you actually try and take 20 humans and have them coordinate in the amplification setting it also depends on hard philosophical questions like I mentioned earlier the question like what should a super intelligent AI do if you have like a formal condition for it to do it then you're probably a lot easier our current position is like we don't know in addition to solving that problem we're going to be defining that problem like should is a tricky word it's us I think category of difficulties there's a third big category of difficulties corresponding to like and these are the third category is made something you just wait on but like current AI is not sophisticated enough to say run interesting debates that is like if you imagine the kind of debate between humans that's like interest a newly promoting truth that involves a very complicated learning problem debaters have to solve and to think right now like it was like that problem is kind of just at the limits of our abilities like you could imagine in some simple settings training that kind of AI and so one option would just be to wait until AI improves and say we're gonna try and study these techniques in simpler cases and then apply them with like the real messiness of human cognition only once AI is better I know their opportunity to try and push safety out as far as one could go so that's actually starting to engage with like the messiness of human cognition and to be clear that's in parallel this the second step I suggested is like philosophical difficulties and asking well this is actually a good scheme that's totally going to have to even today involve engaging a ton with humans like that involves actually running debates actually doing this kind of like decomposition process that underlies amplification so maybe those are the three main categories of difficulty that I see I think all of them seem very important I think like my current take is probably the the most important ones are figuring out whether this is a good idea rather than being actual obstructions to running the scheme I think it's like quite realistic too relatively soon be in a place where you could use this procedure to train a powerful AI and the hard part is just getting to the point where like we actually believe that's an idea or we've actually figured out whether that's a good idea and then I mean let's not just figure now it's also like modifying that procedures so that they actually are a good idea yeah that that makes a lot more sense now okay let's push on to different line of research if you're doing into a prosaic AI alignment if you've got a series of post about this on AI alignment comm yeah what'swhat's kind of the document you're making and what is per se a ki so I would describe this as motivating goal for research or a statement of like what we ought to be trying to do as researchers working on alignment and roughly what I mean by prosaic AI is AI which doesn't involve any unknown unknowns or a which doesn't involve any fundamental surprises about the nature of intelligence so we can look at existing ML systems and say whether or not I think this is likely we could ask what would happen if you took these ideas and scaled these ideas up to produce something like sophisticated behavior human level intelligence and then again whether or not that's likely we can sort of understand what those systems would look like much better than we can understand what other kinds of systems would look like just because they would be very analogous to the kinds of systems we can build today and so in particular what that involves I guess if but the thing we're scaling up is something like uses Sting existing techniques in deep learning that involves defining an objective defining a really broad class of models so really giant neural nets or some complicated model involving attention and internal cognitive workspaces and then just optimizing over that class to find something that scores well according to the objective and so we'd imagine yeah so that that's like the class of Technica that's the basic technique and you could say what would happen if it turned out to that technique could be scaled up to produce powerful AI that's what I mean by prosaic AI and then the task would be to say supposing you live in that world supposing we're able do that kind of scale-up can we design techniques which allow us to use that AI for good or allow us to use that AI to do what we actually want given that we're assuming that that a can be used to have some really big transformative impact on the world yeah so there's a few reasons you might think this is a reasonable goal for research so maybe one is that it's a very like a concrete model of what AI might look like and so it's relatively easy to act work on instead of sort of being in the dark and having to speculate about what kinds of changes might occur in the field a second reason is that even if many more techniques are involved in AI like it seems quite likely that doing gradient descent over like rich model classes is going to be one of several techniques and so if you don't understand how to use that technique safely it's pretty likely you're going to have a hard time maybe a third reason is that I think there is actually some there is some prospect the existing techniques will go further than people guess and that's a case that's like particularly important from the perspective of alignment because in that case like people sort of by hypothesis be caught a little bit by surprise there's not that much time to do intervening I'd like to do more work between now and then so I think in general I would advocate for a policy of like look at the techniques do you understand currently and try and understand how to use those techniques to safely use those techniques and then once you've really solved that problem once or like now we understand how to make you know how to do gradient descent in a way that produces safe AI then you can go on and look towards future techniques that might appear and ideally would understand sort of for each of the techniques that might play or one building or AI you'd have some analogous safe version of that technique which doesn't introduce problems with alignment but is roughly equally useful so guess the people who wouldn't be keen on this approach would be those who are confident that current methods are not going to lead to very high levels of general intelligence and so the end and so they expect that the techniques that you're developing now just won't be useable cuz it could be so different yeah I guess there's two categories of people who might be super skeptical of this as a goal yeah so one will be as you said people who just don't believe that existing techniques are going to go that far or don't believe that they're going to play an important role in powerful AI systems and then a second would be those who think that's plausible but that the project is just doomed that is that there is going to be no way to produce an analog of existing techniques that would be aligned even if they could in fact play a role and so CI systems I think both of those are reasonably common perspectives I think in a minute we'll talk about marry and their view and I guess is perhaps a bit of combination of the two of them yeah oh there yeah there are the strongest proponents of the second view that like we're super doomed in a world's worse this kid it looks anything like existing systems yeah can you lay out the the reasons both for and against thinking that current techniques in machine learning can lead to general intelligence yeah so I think one argument in favor or one simple point in favor is that we do believe if you took existing techniques and ran them with enough computing resources we I mean there's some there's some anthropic weirdness and so on but we do think that produces general intelligence based on observing humans which are effectively produced by at the same techniques so we do think if you had enough compute that would work that probably takes sort of if you were to run a really naive analogy with the process of evolution you might think that if you scaled up existing ml experiments by like 20 or is it magnitude or so that then you would certainly get general intelligence so that's one does this basic point that like probably these techniques would work at large enough scale and so then it just becomes a question about what is that scale that is how much computer you need before you can do something like this to produce human level intelligence and so then the argument saver become quantitative arguments about why think for his levels or necessary so that could be an argument that talks about the efficiency of our techniques compared to the efficiency of evolution or like examines ways in which evolution probably uses more compute than we'd need it includes arguments about things like computing hardware saying how much of those 20 orders of magnitude we'll just be able to close by spending more money and building faster computers which is like tourism and it sounds like a lot but actually you covered you know we've covered more than twenty two orders of magnitude so far and we will cover a significant fraction of those over the coming decade or you can also try and run arguments on analogies like look at how effectively or how much compute existing systems take to train and try to understand that so you just try and say based on what our experience so far how much compete do you think will be needed that's like probably the most important class of arguments in favor and this sort of other qualitative arguments like there are lots of tasks that were able to do so like you'd probably want to look at what tasks we have succeeded at or failed at and try and fit those into that quantitative picture to make sense of it I think it's like not insane to say the existing system seems like they possibly reached the level of sophistication of like insects so we are able to take very brute force approach doing search over like neural nets and get behavior that's and this is totally unclear but I think it's plausible that existing behavior is as sophisticated as insects and if you thought that then I think it would constitute an argument favor yes I guess arguments against probably the most salient argument against is just like if we look at the range of tasks humans are able to accomplish we have some intuitive sense of how quickly machines are becoming able to do more and more of those tasks and I think many people would look at that rate of progress and say look if you were to extrapolate that rate it's just gonna take a very very long time before able do that many tasks I think a lot of this is just like people extrapolate things in very different ways so some people would look at like being able to toss an insect can do and say wow insects have like reasonably big brains on a scale from nothing to human we've come some substantial fraction of the way we're like perhaps plausibly going to get there just bicycling this up other people would look at what insects do and say look insects exhibit like almost none of the interesting properties of reasoning you've captured some like very tiny fraction of that presumably it's gonna be a really long time if we're able to chur even like a small fraction of interesting human cognition what are the aspects of cognition that seemed most challenging or I guess most likely to require a major research insights rather than just increasing the the compute so over long again with enough compute you would sort of expect or I would be willing to vouch that you would get everything in human cognition and the question is in some sense which aspects of cognition are like the most expensive to producing this way or most likely to be prohibitively expensive such that you can't just find the movie for search enough to actually understand them so natural things are like properties of human cognition to operate over very long timescales like maybe evolution got to take a lot of cracks of like developing different notions of curiosity until I found a notion of curiosity which was effective or like a notion of play that was effective for getting humans to do useful learning it's not clear that you can evaluate if you have like some proposed like set of motivations for a human that you're rolling out it's not clear we can evaluate other than by actually having a bunch of human lifetimes occur and so if there's a thing you're trying to optimize where every time you puzzle in order to check it you have to like run a whole bunch of human lifetimes that's gonna tickle checks and so if there's like cognitively complicated things that only right so like maybe curiosity simple but if you have a thing like curiosity it's actually very complicated or involves lots of moving parts then it might be very very hard to find something like that by this brute force search things that operate over very short time scales are much much more likely to I mean then you can try a whole bunch of things you can get feedback about which works but things that operate over very long timescales might be very hard so it sounds like you're saying at some level of compute you're pretty confident than current methods would produce human level intelligence and maybe much more I think a lot of listeners would find that claim somewhat surprising or least being confident that that's true yeah what'swhat's the reason that you think that yeah so there's a bunch of a bunch of things to be said on this topic so maybe a first thing to say is human intelligence was produced by this process of try some random genome to take those genomes which produced the organisms to the highest fitness and then randomly bury those a little bit and see where you get in order for that process to produce intelligence you definitely need a bunch of things that a minimum to try number of possibilities okay now we're just discussing the claim that with enough compute this would work so in a minimum you to try a whole bunch of possibilities you also need like an environment in which reproductive fitness is like a sufficiently interesting objective so one reason that you might be skeptical of this claim is that you might think that the environment that sort of humans evolved in or the lower life evolved in like is actually quite complex and we wouldn't have access even if we had large amounts of compute we wouldn't actually be able to create an environment rich enough to produce intelligence in the same way so that's something I'm skeptical of largely because I think I mean humans operate in this physical environment almost all the actual complexity comes from other organisms so that's sort of something you get for free if you're spending all this computer running evolution because you get to heavy earth you get to have the age and you're producing interact with itself I guess other than that you have this physical environment which is very rich and like you know quantum field theory is very computationally complicated if you want to actually simulate the behavior of materials but it's not an environment that's like optimized in ways that like really pull out like human telogen s-- is not sensitive to the details of like the way that materials break if you just substitute in if you take like well it materials between you like apply stress and you just throw in some random complicated dynamics concerning how materials break that's about as good it seems as the dynamics from like actual chemistry and like yeah until you get to a player humans are starting to build technology that depends on those properties and by that point like the game is already over at the point when humans are building technologies that really exploit the fact that we live in a universe with like this rich and like consistent physics at that point you already have human level intelligence effectively there's not much more evolution occurring so yeah I may be on the environment side I think most interesting complexity comes from organisms in the environment and there's not much evidence that like the considerable computational complexity of the world is actually an important part what gives you human intelligence a second reason people might be skeptical is they might like this estimate that's 20 orders of magnitude thing would come from like thinking about the neurons and all the brains of all the organisms that have lived you might think that you know maybe most interesting computers like being done early in the process of development or like something about the way that genotypes translate into phenotypes if you think that you might think that the neuron counts are a great underestimate for the amount of should in compute or similarly think of the things in the organisms are more interesting than either development or neurons I think that's like my name's isiaih nears that it really does look like we understand the way in which neurons do computing like a lot of the action is sending action potentials over long distances brain spends a huge amount of energy on that it looks like that's the way that organisms do interesting computing it looks like they don't have some other mechanism that does a bunch of interesting computing because otherwise they wouldn't be spending this huge amount of energy implementing the mechanism we understand so it doesn't look like brains work the way we think they work yeah so I guess there's some people who think that there's a lot of computation going on within individual neurons and Veolia are skeptical of that yes I think my view would be that mostly the hard thing about say if you want to simulate a brain you can imagine there's being two kinds of difficulties one is like simulating the local dynamics of neurons and a second is moving information long distances say as you have fire action potentials and I think most likely both in the brain and in computers like the movement of information is actually the main difficulty it's like the dynamics within a neuron just don't they might be very complicated it might involve a lot of arithmetic operations perform that that simulation but I think it's not hard to compared to just shuffling the data around and shuffle in day out around we have a much clearer sense of exactly how much happens because we know that there's these action potentials action potentials communicate information basically only in the timing I mean is a little bit more than that we can basically we know sort of how much information is actually getting moved it looks like ones and zeros yeah it looks like ones and zeros that moves to the extra bits are in timing and we sort of know roughly what level of precision there is and so like there's not that many bits per action potential so I didn't have a lot of understanding of like the specifics of like how machine learning works but I would think that one objection people might have is to say that even if you had lots of compute and you tried to make the parameters of this looking at the machine at learning algorithm more accurate just the structure of it might not match like what the brain is doing so you might you might just like cap out at some level of capability because you it's just there's no way for they like current methods or the current way that the data is being transformed to actually be a produced general intelligence do you think there's any any argument for that or is it just the case that kind of the methods we have now like at least at some level of objection are analogous to what the human brain is doing and therefore with the like sufficient amount of compute but maybe I'll commit maybe a very very high amount but but they should be able to copy everything that the human brain is doing yeah so I would say that most of the time machine learning we like fix in architecture and then optimize over sort of computations that fit within that architecture obviously when evolution optimizes for humans it does sort of this very broad search over possible architectures like looking our genomes that encode like here's how you put together a brain we can also do a search over architectures and so the natural question becomes like you know how effective are we at searching our architecture is compared to evolution I feel like this is mostly in the regime of like just computational question okay that is we sort of know I mean the the very highest level architecture that evolution uses isn't that complicated sort of at the at the meta level and so you could in the worst case just do a search at that same level of abstraction I guess one point that we haven't discussed at all but is a relevant for some some people would consider super relevant is like anthropic considerations concerning the evolution of humans so you might think that evolution only extremely really produces intelligent life but that we happen to live on a planet where that process worked yeah what do you make of that so I think it's kind of hard to make it fit with evolutionary evidence this is something that like I think Karl Shulman Nick Bostrom have a paper about this and some other people have written about it periodically I think that like the rough picture is that intelligence evolves like too quickly like if this is the case if there's some hard step in evolution it has to be extremely early in evolutionary history so in particular has to happen considerably before vertebrates and probably has to have happened by like simple worms why is that because those steps took longer than the later steps did well so one reason I think the easiest way to put it before vertebrates is just to say that like cephalopods seem pretty smart and the last common ancestor between like an octopus and a human is some like a simple worm I think that's probably the strongest evidence that's from this paper by Carl okay because because then we have another line that also produced a substantial intelligence independently independently and that would be incredibly suspicious if it happened twice on the same planet and there we don't have the anthropic argument because you could live on a planet where everything happened was that's right and you could think that like maybe there's a hard step between octopi and humans but then we're getting into the regime where like again for any place you look what is this hot stuff yeah many things happen twice like no birds and mammals and dependently seem to become like very very intelligent you could think like maybe an early vertebrates there was some like lucky architectural choice made in the design of furtive bird brains that causes on the entire vertebrate line intelligence will then sort of systematically increase quickly but like what was important was this lucky step but at some point like you can try and run some Argan before you might get stuck before humans it seems pretty hard to do or like doesn't seem very convincing and it certainly doesn't seem like it would give you an argument for why you wouldn't reach at least like octopus levels of intelligence so like if you're like thinking that existing techniques are gonna get stuck anywhere around their current level then this kind of thing isn't gonna be very relevant yes I guess it kind of raises a definitional question of what what is current techniques like how much you change the architecture before you say oh this is no longer like current machine learning methods so this is this is all of a prosaic AI yes I think the thing that's really relevant from the perspective of climate research is like you want to assume something but what you can do and the thing you want to seem you can do is like there is some model class you Touma's over that model class given an objective maybe you care about whether the objective has to supply gradients but maybe doesn't even matter that much so then as an alignment researcher you say great now researchers have handed us a black box the black box works says like follows the black box takes some inputs producing outputs you specify how good those outputs were and then the black box adjusts over time to be better and better and like as an alignment researcher you don't as long as something fits within that framework you don't necessarily care about the details of like what kind of architecture you're searching over are you doing architects research or what form does the objective take well what form the objective takes you may care about but most other details you don't really care about because the Limon research isn't gonna be sensitive to those details so in some sense like you could easily end up with the system the existing I'm all researchers would say wow that was actually like quite a lot different from what we were doing in 2018 but which an alignment researcher would say that's fine like the important thing from my perspective is this still matches with the kind of alignment techniques that we were developing so we don't really care how different it looks we just care about did it like basically change the nature of the game from the perspective of Liman yeah can we look backwards in history and say you know wood techniques that we developed five or ten years ago work on today's architectures yeah so we can try and look yeah we can look back hindsight is always complicated and like hazardous but I think that you would say if you were to say in 1990 perform a similar exercise and look across techniques I would say certainly the kind of Dinks are talking about now would exist though be part of your picture they would not have nearly as much of a view as much of a focal point as they are today because they hadn't yet worked nearly as well as they've worked now so I guess we would be talking about like what fraction of your field of view would like these techniques occupy so I think I think it's pretty safe to say that like more than 10% of your field of view would have been taken up by like the kind of thing we're discussing now and the techniques developed with respect like with that 10% of possibilities in mind would still apply like existing systems are very very similar to sort of the kinds of things people are imagining in the late 80s and there's a question like is that number 10% or is it a third I think that's pretty unclear and I could I don't have enough of a detailed understanding of that history to like really be able to comment intelligently and I'd want to defer to people who are doing research in the area at that time I do think the like if you had instead focused on different kinds of techniques like if you've been around in the further past and you were say trying to do a alignment for expert systems like I don't feel that bad about that I guess some people look back at history and say like man that have been a real bummer if you've been alive in 60s and you done all this alignment research that didn't apply to what the kind of day I we're building now and my perspective is kind of like look one if it takes 50 years to build AI it doesn't matter as much what the details are of the LM mortgage in the 60s to actually there's a lot of overlap between those problems like many of the philosophical difficulties you run into alignment are basically the same even between existing systems and expert systems 3 like I would actually be pretty happy with the world bro like when people pose a technique you know a bunch of a line researchers invest a bunch of time and understanding alignment for expert systems and then 15 years later they move on to their second thing it's like not that bad a world like I expect you would in fact like if you just solve the sequence of concrete problems like that actually sounds pretty good it sounds like a way to get practice as a field sounds reasonably likely to be useful there's probably lots of commonalities between those problems even if they turn out to be totally wasted like it's a reasonable bet and expectation like if you sort of have to do that's a cost you have to pay if you want to have done a bunch of research for the techniques that are actually relevant unless you're like very confident that the current techniques are not the things that will go all the way or that it's doomed I think they both those positions seem really really hard to run to me or like I don't know I haven't heard very commentary wins free to those positions what's expert systems like the system's based on having a giant set like maybe reasoning rules and facts and then they just sort of use these reels to combine these facts and that just didn't work yeah there's a period where people are my optimistic about them I don't know the history very well at all I think in general certainly it didn't realize the ambitions of the most ambitious people in that field and certainly it's not the shape of most existing or like the kinds of systems people are most excited about today okay so we mentioned that a group that has kind of a different view from from this prosaic is on the machine intelligence Research Institute at Berkeley if I understand quickly is kind of you got it to a our safety and in part through at least that that social group all that intellectual groom but it seems like now you kind of recommend you kind of represent a different node or different now it's the time access within within people working in AI safety yeah how would you describe their view and and how it differs from yours today and I'd say the most important difference is they believe that this prosaic alignment project is very likely to be doomed that is they think if the shape of sophisticated systems resembles the shape of existing ml systems or like if it in particular you obtains today I by defining a model class to finding objective and doing gradient descent to find the model that scores will occurrence the objective then they think we're just extremely doomed such that they think the right strategy is to instead step back from that assumption and say can we understand other ways that people could build sophisticated AI part of that is like you know if you're doing grading descent over this big model like we can great ascend to find a model that performs well you're gonna end up with some actual particular model or end up with some particular like way of thinking that you're like general net embodies I could instead instead of just specifying procedures that give rise that way of thinking you could actually trying to understand that way of thinking directly it's a great now that we understand this we can both reason about its alignment maybe we can also design it more efficiently or we can more efficiently describe search procedures that will uncover it once we know what it is that they're looking for so I'd say that's like the biggest difference and but the crux there I think is mostly the is it possible to design alignment techniques that you make something like existing ml system safe and so my view is that most likely that's possible not most of like more likely than not not like radically more likely than not but somewhat more likely than not that's possible and that as long as it looks possible and we have like attractive lines of research to pursue and like a clear path forward we should probably work on that by default and that we should at some point if we're like wow it's really hard to solve alignment your systems look anything like existing ml then what you really want to understand as much as you can why that's hard and then you want to step back and say look people in ml it looks like the thing you're doing actually is sort of like unfixable a dangerous and maybe it's time for us to like think about like weird solutions where we have to have let's change the overall trajectory of the field based on this consideration about alignment maybe it's not reasonable to call that weird like from the outside of you might think well the goal of AI is to make things good for humans it's not crazy to change the direction of the field based on like what is possibly going to be aligned a bowl but it would seem strange to them today yeah people in ml would not be like oh yeah that makes a lot of sense let's just swap what we're doing so I guess my position would be like I'm currently much much more optimistic to murder people I think that's the main disagreement about whether it'll be possible to solve alignment for the prosaic AI systems and I think that like as long as we're optimistic in that way we should work on that problem until we discover why it's hard or solve it just to make it more concrete for people what are the kind of specific questions that Mary is researching that they think a useful I think at this point Mary's public research stuff that they publish on and talk about Internet like one big research areas decision theory so I understand being like supposing that you have some agent which like is able to make predictions about things or like has views on like the truth of various statements how do you actually translate those views into students this is tricky because you want to say things like you care about quantities like what would happen if I were to do X and it's not actually clear what would happen if I were to do X means like is it a causal counterfactual like a certain kind of statement it's like not that's the right kind of statement this is causal decision theory evidential decision theory and ya notice the stuff they don't consider seriously is like I think once you get like really precise about any like we'd like this to be an algorithm like the whole picture just gets like a lot weirder and like a lot of distinctions that people don't normally consider in the philosophy philosophical community they're kind of you have to wade through if you want to be at the place where you're like a serious proposal for like here's an argument for making decisions given as input like views on this empirical questions or given as input like a logical inductor or something like that since one class of questions that they work on Hank another big class of questions they work on is like I mean like stepping back from like looking at the whole problem and saying from a conceptual perspective like supposing you grants this view this worldview about what's needed like what yeah I don't know a good way to define this problem it's kinda just like figure out how you would build an aligned a I which is a good it's a good problem the very high level problem I norsu like some people think bout the very high level problem I think it's one of the more useful things to think about there's some flavor of it that depends on like what facts you like what you think of the important considerations or what you think the difficulties are I'm so they working like a certain version of that problem I think other examples of things include like they're very interested just in like what are good models for rational agency so like we have such models in some settings like in Cartesian settings where you have like you know an environment an agent that communicate over some channel where they send bits to one another it becomes much less clear what agency means once you have like an agent that's physically instantiated in some environment that is what does it mean to say that the human is like the consequentialist thing acting in the world given that the human is just actually like some of the degrees of freedom in the world like linked together in this complicated way to make a human it's like quite complicated to talk about what that actually means that there was like this consequentialist in the world that's the thing they like super interested in figuring out how to reason about systems that are like logically very complex including like systems that contain yourself or contain other agents like you like how do we formalize such reasoning another big issue does Mary have a different view as well about the likelihood of current that's producing general intelligence or is that um there's probably some difference there though it's like a lot less stark than the other you know one I think maybe a difference in that space that's more closes like I kind of have you that's more like there's probably some best way there's some like easiest way for like our current society as its currently built to develop AI and the more of a change you want to make from that default path the more difficult it becomes whereas the intermarry perspective would more be like the current way we build ml is reasonably likely to just be very inefficient and so it's reasonably likely if you were to step back from that paradigm and try something very different that it would be like comparably efficient or maybe more efficient and like I think that's a little bit I'm not yeah I guess I don't buy that claim I don't think it's as important as the definite doom claim yeah so um what are the best arguments for the merriest point of view that the current methods can't be made safe so I guess I guess I'd say there's two classes of problems that they think might be unresolvable one is like if I perform if I have some objective in mind suppose I even have like the right objective an objective though I perfectly tracks how good a model is all to considered according to human values and then i optimize really aggressively on that objective the objective is still just a feature of the like a behavior of this side this black box I'm optimizing over like the weights of some neural network so now have an objective like perfectly captures whether the behavior is good for humans and I optimized really hard on that objective so one of Mary's big concerns is that even if we assume that problem is resolved and they you have such an objective then it's pretty likely are going to find a model which only has these desirable properties like on the actual distribution where I was trained and that it's reasonably likely that in fact that system that you've trained is going to be some consequentialists who wants something different from you know human flourishing I just happens on the training distribution to do things that look good within that narrow range ah yes so an example of this phenomenon that in community people think is pretty informative that's certainly not decisive on its own it's like humans were evolved to produce lots of human offspring and that it's the case that like humans are sophisticated consequentialists whose terminal goal is not just producing offspring so that even though they like cognitive plots that humans use is very good for producing offspring over like from an evolutionary history it seems like it's not actually great it's sort of like already broken down to a considerable extent and in the long run looks like it will break down to a much much greater extent so that if you were like a designer of humans being like I know I'd like to find this objective that tracks how many offspring they have now I'm an optimize over many generations I mean to my's biological life over a million generations to find the life which is best at producing offspring you'd be really bummed by the results so there's sort of expectation is that in a similar way we're gonna be really bummed we're gonna like optimize this neural net over very large number of iterations to find something that appears to produce actions that are good by human lights but we're gonna find something whose relationship to human flourishing is similar to like humans relationship to reproduction where like they sort of do it as a weird byproduct of a complicated mix of drives rather than because that's the thing they actually want so one generalized them it behaved very strangely okay sounds kind of persuasive what's what's what's the counter-argument so I think there's a few a few things to say in response so one is that evolution does a very simple thing where you sample like environments according to distribution and then you see what agents perform well on that on those environments we when we train ml systems can be a little bit more mindful than that so in particular we are free to sample from whatever distribution over environments we're any distribution we're able to construct and so as someone trying to solve for zi alignment you are free to look at the world and say great I have this concern about what the system I'm training is going to be robust or whether it might generalize in a catastrophic way on some new kind of context and then I'm free to use that concern to inform the training process I use so I can say look great I'm going to adjust my training process by say introducing adversary and having the adversary try and construct the inputs on which the system is going to behave badly that's something that people do nml adverse sail training and if you do that that's like very different for the process evolution ran right now you imagine that there's someone roughly as smart as humans who's like constructing these weird environments like if they look into humans and say great the humans seem to care about like this art ship then the adversary is just like constructing an environment where humans have lots of opportunities to like do art or whatever and then if they don't have any kids then like they get down weighted it's like if there's some gap if there's some context under which humans like they'll do maximize reproductive fitness then adversary can like specifically construct those contexts and use that to select against again the reproductive Fitness analogy makes this sound kind of evil but you should replace the reproductive Fitness with things that are good yes that's one thing I think the biggest thing probably is that like as the designers the system we're free to do whatever we can think of to try and improve robustness and we will not just like simple we can look yeah we can look forward rather than just look at the present generation although there's it's challenging problem to do so you know I'm set the thing abunch of people work on it's not obviously they'll be able to succeed certainly they don't think they like this analogy should make you think like I think the analogy maybe says there is a problem there's like the hospital problem but doesn't say doesn't say like and that probably resistant to any attempt to solve it because it's not like evolution made a serious attempt to solve the problem yeah if you can make the method corrigible so that you can continue improving it changing it even as you're going with a guy transforming the world I mean it seems like I would partially solve the problem because what she's here is that kind of humans ended up with with the motivations that they have desires that they have and then like we're going about in basically a single generation or you know handful of generations in evolutionary time changing everything about the environment other the environments changing much faster than we are such like we've become like it how drives no longer match this what would actually be really required to make us reproduce at the maximal rate whereas if you chained if you were changing humans like as we went like like as our behaviour ceased to be like adaptive from that point of view then perhaps you could keep us in line so it would be like fairly close to the maximal reproductive rate do it does that make sense yeah I think that's like part of an important part of the picture for why we have hope that is if you're like yet we're gonna evolve a thing with just like once human flourishing we're gonna like do great ascend till we find a thing that really wants to enforce and we're gonna let it rip in the universe like doesn't sound great but if you're like we're gonna try and find a thing which like helps humans in their efforts to like continue to create systems that help humans continue to create systems that help humans achieve like help humans flourish like then that's I guess like you could imagine in the analogy instead of like trying to evolve a creature which like just really cares about human flourishing you're like trying to evolve a creature that's like really helpful and like somehow be really helpful and like don't kill everyone and so on it's like an easier set or like a more imaginable set of properties to have sort of even in across a very broad range of environments then like matching some exact notion of what constitutes flourishing by human lights I think one reason that people like at mirror from a similar school cloudy pessimistic about that is they have this mental image of like humans participating in that continuing training process sort of training more and more sophisticated eyes if you imagine like a human is intervening and saying here's how I'm going to adjust this training processor here's how I'm going to like shape the course of this process like it sounds kind of hopeless because humans are like so much slower I meant in many respects like presumably so much like less informed less intelligence that they might just be adding yeah and it would be very expensive to human involvement and yeah mostly like they wouldn't do put in a good direction some random direction I think the like the main response there is like you should imagine humans performing this process early or like early on in this process you should imagine humans being the ones adjusting objectives or adjusting the behavior in system later on you should imagine that is mostly being carried out by like the current generation of AI systems it's like the reason that humans can keep up is that process goes faster and faster is hopefully because we're maintaining this property that they're like are always a whole bunch of the AI systems trying to help us get we want will it continue to bottom out in some sense like what humans say about well like how about how the upper level is going I'm actually if there's like multiple levels of like the most advanced AI then the less advanced AI guess it's kind of what we talked about earlier yeah then you've got kind of humans at the bottom dead so do at some point they just disappear from the equation and it's like oh like yes so they're always gonna be encouraged like what a human would have said yeah Simpsons that's like the only source of ground truth in system humans might not actually be there might be some year beyond which humans never participate but at that point the reason that would happen would be because there is some system like supposing you know you're twenty forty two humans stop ever providing any input to as systems again the reason that be possible is that in year 2042 there is some AI system which we already trust her bus like doing well enough according to human lights yeah and I could do it faster and cheaper yeah it's a little bit tricky to ever have that handoff occur because that system you know in 2040 to the one that you're trusting to hand things off to have never been trained on things like happening 2043 yeah so it's like a little bit complicated and it's not you're gonna keep running that same system in 2042 it's that that system is going to be sufficiently robust that it can help you train the system in 2043 that is going to so yeah yeah if you could visit you know fifty years in the future and see see everything that was happening there how likely do you think it would be that you would say like my view of this was was broadly correct versus like mirrors view was like more more correct than mine with with hindsight like I mean I'm trying to measure kind of how confident you are about your your general perspective yes I certainly think there are a lot of cases where like well both views were very wrong in important ways and then like you could easily imagine both sides being like yeah but my view is right in the important way it's not certainly a thing which seems reasonably likely in terms of like thinking in retrospect that my view is like unambiguously right I don't know like like relative to memories of you know not like 50 to 70% that's pretty high but ever like for 60% but like in retrospect we'll be like oh yeah this was like super clear yeah and then maybe on the other side like I would put a relatively small probably that's super clear in the other direction they're like you know maybe 20 percent or 10 percent or something okay that like in rushed with him like geez I was really wrong they're clearly like my presence in this debate was just making things worse yeah and then there's kind of a middle ground of both of them had important things to say yeah yeah interesting okay so you're like reasonably reasonably confident but it seems like bu but you would still then support given those probabilities like Mira get doing substantial research into their line of inquiry yeah I'm pretty I'm excited by Murray doing the stuff he's doing I would prefer they like Emery people do things that were better on my perspective which I expect is most likely to happen if like they came to agree more with this perspective but at some point it'll let's say that like your line of research had four times as many resources or four times as many people and you might say well you know having one more person on this other thing could could be more useful even given your views right yeah although I don't think the situation is not like there's a lot of research a online research B and the chief disagreement is about which line of research to pursue it's more like if I was doing thing very very similar to what Mary is doing like order officially quit similar I would do it in like a somewhat different way just like to the extent I was working on like philosophical problems that clarify our understanding of cognition or agency like I would not be working on the same set of problems that more people are working on and I think like those differences matter probably more than the like you know to high-level kind of what does the research look like yeah they're like lots of stuff in the general space at research memories doing that I'd be like yep that's like a good thing to do which now we're in this regime of yeah it depends how many people are doing one thing versus doing the other thing do you think that there's if your view is correct is there going to be much like incidental value from the research that the miry is doing or is it kind of just by the by at that point so one way in which research of the kind murray is doing is relevant is clarifying whether when I talk about amplification or debate they each have this conceptual like key conceptual uncertainty in the case of debate like is it the case that debates lead to like that the honest strategy telling the truth saying useful things is actually a winning strategy or in the case of amplification is there some way to assemble some large team of aligned agents that the resulting system is smarter than the original agents and remains aligned those conceptual difficulties like seem not at all unrelated to the kind of conceptual why have you asked you know how we build in the line data using infinite amounts of computing power without thinking at all about contemporary ml that's a very similar kind of question you're thinking like what are just like the correct normative standards of reasoning that you should use to evaluate competing claims what are like when you compose these agents like what kind of decomposition of cognitive work is like actually alignment preserving or do we expect to produce correct results so in a natural way in which the kind of research Murray is doing could add value is by shedding light on those questions and an expectation like you know I guess they're like at least several times less effective at answering those questions and if they were pointed at them more directly you know I don't know if it's like five times less effective and if they're pointed them directly I think it's the smaller multiple than that probably what would you say to people who are listening who just feel kind of agnostic on like whether the prosaic eye approaches is the best or or Maris's I mean what would I say in terms of what they ought to do yeah yeah oh well maybe what they ought to think all things considered if they maybe don't feel qualified to to judge this debate sure so in terms of what to do suspect compared to advantage considerations will generally loom large and that one's really agnostic this will likely end up dominating or like compared to manage plus like short-term what will be most informative and all the most learning build the most fungible or flexible capital in terms of like what to think all things considered and like I don't know that seems pretty complicated it's can depend a lot on like what kind of expertise they're just in general like looking at a situation with conflicting people who have thought a lot about the situation like how you decide whose views take seriously in those cases to be clear the spectrum reviews like amongst all people in my view is like radically closer to marys than almost anyone else in the machine learning community yeah on most respects their other respects in which the machine learning can be as closer to me than I am so like the actual menu of available views is unfortunately even broader than this one if I'm Pro to then focus down as view and Mary's generalized yes indeed it's unfortunate yes you saying this third option yeah in fact yeah it's quite a lot brother like I think being Gnostic is like not a crazy response I think there's like an easy position where like well the most confident claims like sort of all these perspectives like differ substantially emphasis but one could basically put significant probability on all of the most confident from every perspective yeah certainly the convex combination of them will be more accurate than a particular perspective and then like in order to significantly better than that you're gonna have to start like making more serious claims about we're willing to trust yeah yeah what would you say is kind of a the third most plausible like broad view so I think one reasonably typical view in machine learning to which I'm sympathetic is the sort of all of this will be mostly okay yeah as AI progresses like we'll get a bunch of empirical experience messing around with ml systems sometimes they will do bad things correctly and that problem will not involve like heroic heroic acts of understanding safety specifically or alignment specifically yeah well like not not beyond what might happen anyway yeah and it's a little bit hard you could separate that out into about the claim but like what will happen anyway and a claim about what is required I guess like the views we're talking about from me and mirror we're more about like what is required and we've separate disagreements about what will happen I think the ml there's like a different ml position on what is likely to be required which says more like yet we have no idea what's likely to be required it's reasonably likely to be easy any particular thing we think is reasonably likely to be wrong and that's like and I could try and flush out the view more but roughly it's this like we don't know it's gonna happen and it reasons likely to be easy or like by default expected to be easy I think like a reasonable chance in retrospect that looks like a fine view yeah I think it's not I don't see how to end up with like high confidence in that view and if you're like you know a 50% chance of that view it's not gonna have that huge in effect on your expected value of working on safety yeah it may be had any hazard of course yes yeah I work best yeah and that like increasing that probability from like if you give that a significant probability that might matter a lot if you had a like we're definitely doomed view so I think like on the me review may be accepting like giving significant credence to the neural machine learning perspective would significantly change what they would do because they're like they currently have this view we were kind of at like zero and if you've seen this post unlike the logistic success curve that Eliezer wrote havin the idea is that if you're like close to zero and then like most interventions if you probably success is close to zero then most interventions that like look common-sense cleat is flawed actually gonna help you very much cuz it's just gonna move you from like 0.01 percent to 0.02 percent so this would be a view that is kind of you need many things all at once to like like you to have any significant chance of success and so just getting one out of a hundred things you need doesn't move you much that's right just making organizations like a little bit more sane like fixing like one random problem here and problem there isn't much going to help so if you have that kind of view that's like it's kind of important then you're putting really really low probability on this isn't the only perspective in ml but it's like one conventional ml perspective I think like on my view it doesn't matter that much if you give that like 30 percent or 50 percent or 20 percent probability but I think that probably is not small enough you should just count that case like interventions that are good if the problem is not that hard like they're likely to be useful I know that it's not high enough that I would have thought that it would make little difference to your strategy because because in that case things would be ok you don't really have to do anything so you can almost just like even if you think it's 5050 whether any of this is necessary or not you can just largely ignore that yeah as I'm saying it doesn't make a huge difference a way in which it matters is like you might imagine there's some interventions that are good like in worlds where things are hard and like it was 5050 and those interventions are still good like maybe they're half as good as they otherwise would have been like your thing and there's some interventions that are good in worlds where things are easy that is you might be like well if things were easy we could still up in various other ways and make the world bad so if you've seen those probabilities like I would say that's also a viable intervention because the ability of things are easy is not low enough to like that's getting driven down to zero so then just more normal world improvement or like making it more likely that we encode good values it like so we sort of the alignment problem is solved then it becomes more a question of like what values will we in fact program into an AI and like trying to make sure that those are good ones yeah I mean there's a lot of things that actually could come up in that world so like for example your AI could have a very like if any AI has a very uneven profile abilities you could imagine having AI systems that are very good at like building better explosives or designing more clever biological weapons there aren't that good at helping us like accelerate the process of like reaching agreements that control the use of destructive weapons or like I say better steering the future and so you would like another problem independent of alignment is this like uneven abilities of AI problem it's like one example mmm or you might just be concerned that as the world becomes more sophisticated like there will be more opportunity for everyone to blow ourselves up yeah we might be concerned though like we will solve the alignment problem we build AI but then someday that I will build a future AI and it will fill up the alignment problem so it's like lots of problems you you care but I suppose is also a I could be destabilizing teshon or relational relations or politics or all be used for like bad purposes even though if so we can give a good instructions and we'll give it instructions to cause harm yes then there's a question of like how much you care about kind of stabilization I think most people would say they currently some like even if you have like a very focused on the far future perspective there's some way in which like that kind of stabilization can lead to irreversible damage so yeah there's a bunch of random stuff that can go wrong but they I and you might become like more interested in attending to that or saying like how we solve those problems with a media for understanding of alignment if like a media understanding of alignment doesn't automatically doom you yeah yeah there anything else you want to say on I guess Meiri before we move on obviously at some point I'll get someone on from there to defend their view and explain like yeah what what research they think is most valuable hopefully sometime in the next day a couple of months yeah so you guess one thing as mentioned earlier they were like these two kinds of concerns or two kinds of arguments that give that we're super doomed on prosaic if hey i looks like this day our mail systems i mentioned one of them this like even if you're the right objective plausible but i think you produce will have some other real objective will be a consequentialist who just incidentally is pursuing that objective on the train distribution there's the second concern with that like actually constructing a good objective is incredibly difficult so in the context of like the kinds of repose alive been discussing like in context to iterate amplification they'd then be saying like well all the magic occurs in the step we're aggregating a bunch of people to make better decisions than those people would have made alone and in some sense like any way that you try and solve prosaic i alignment is going to have some step like that we sort of are implicitly encoding some answer to the alignment problem in the limit of infinite computation is they might think that that problem like alignment in that limit is still like it's initially difficult or like has all the core difficulties in it so they're like it's not clear this might say that we're doomed under particular element but like more directly would just say like great we need to solve that problem first anyway i think there's no reason to work on like the distinctive parts of ezekiel I mean rather than trying to attack that conceptual problem and then either learning that we're doomed or like having a solution which we could then maybe it would give you a more direct angle of attack so you're on the board of a newish project called or what what is that all about so the basic mandate is understanding how we can use machine learning to help humans make better decisions the basic motivation is that we are super interested in like if machine learning makes the world a lot more complicated and like is able to like transform the world we want to also ensure that machine learning is able to help humans like understand those impacts and looks to the world in a good direction that's in some sense with what the alignment problem is about you want to avoid the situation with some mismatch between how well AI can help like develop new technologies and how well I can help you like actually manage this more complicated world that's creating I think the main project I certainly like that project at odd I am most interested in is on what they call factor to cognition which is basically understanding how you can take complex tasks and break them down into pieces where each piece is simpler than the whole so it doesn't depend on the whole context um and then compose those contributions back to solve the original task so you could imagine that even in the context of taking a hard problem and breaking it down into pieces that individual humans can work on so like say you know hundred humans you don't want anything have to understand the entire task gonna break off some little piece that that person can solve or you could think of it in the context of machine learning systems and in some sense the human version is like the most interesting because it is like a warm-up or a way of studying in advance the MLS version so in the ML version that would be you know if some instead of a hundred people you have like some people and a bunch of ML systems which have some set like maybe an ml system has like more limited ability to respond to complex context like a human has a lot of context in mind when their take ladies problem and all system assimilate and ability respond to that context or like fundamentally I think the most interesting reason to care about breaking tasks down to small pieces is because once you make the tasks simpler like once an ml system is solving some piece it becomes easier to evaluate its behavior whether its behavior is good it's like this is very related to in like the way that iterated amplification hopes to solve a lineman is by saying we can like inductively train more and more complex agents by composing like weaker agents to make stronger agents so this factor cognition project is most like it is one possible approach for composing a bunch of weaker agents to make a stronger agent in that sense it's like one of the main addressing like one of the main ingredients you would need for iterative amplification to be able to work I think right now odd is kind of the main project that's aiming at acquiring evidence about how well that kind of composition works again in the context of like just doing it with humans since suman's or something we can study today we can just recruit a whole bunch of humans and there's like a ton of working like actually starting to resolve that uncertainty and we can learn about there's a lot of work we have to do before we'll be able to tell like does this work or does it not work but see that's like one of the main things I was doing right now the reason I'm most excited about it is this a business that's earnest is a non-profit don't profit okay is all hiring at the moment and what kind of people are you looking for yeah so I think there are some roles that will hopefully be resolved by the time or like hi four buttons podcast comes out some things that are likely to be like continuing hires our researchers who are interested in understanding this question like understanding and thinking about how you compose like Kleber I compose like the small contributions and dissolution starter tasks yeah and that's a you know there are several different disciplines that potentially bear on that but like sort of people interested in computer science or interested in like they approach they're taking means that like people in certain programming languages are also reasonable fit people who are just thing just like some stuff that doesn't fit well clearly into any academic discipline but you just think about the problem like how you put together a bunch of people like how do you set up this experiments and how do you like how do you help humans be able to function as part of such a machine so researchers who are interested in those problems is one genre and another is like engineers for interests and helping actually build systems that will be used to test possible proposals or will instantiate the sort of best guesses about how to solve those proposals and those will be a like in contrast opening is currently hiring researchers and engineers in like ml so sort of engineers would then be building ml systems testing animal systems debugging and improving and so on ml systems I need a lot like similarly hiring both researchers and engineers and people in between but they're the focuses less on ml and is more on like again building systems that will allow humans to like humans and other like simple automation to collaborate to solve hard problems and so it is more like it involves a less of a like distinctive ml background it's more potentially a good fit for people who have a software engineering background and like the problem set interesting and they've like some some relevant background or just the problems have interesting and they ever broad background in software engineering ok well I'll stick up a link to it to the website with more information on specifically what it's doing and I guess what vacancies are available whenever we manage to edit this and get it out cool ok let's let's talk about what listeners who are interested in working on this problem can actually do what about you have for them so we've had a number of episodes and they are safety issues which have kind of covered these topics before with daria remedy your colleague as I mentioned we unlike a deep mind as well as Miles Brundage and Ellen de fer at FHI working on more policy and strategy issues do you have a sense of where your advice might deviate from those of those four people or just other other people in general on this topic so I think there's a bunch of categories of work that need to be done or that we'd like to be done I think I'd probably agree with all the people who just listed about like each of them presumably would have advocated for some kind of work so I guess Daario and yon probably eradicating for machine-learning work they really tried to apply like connect like ideas about safety to our actual implementations building up the engineering expertise to make these things work and acquiring a Pyrrhic leviton's about like what works and what doesn't and like I think that project is extremely important and like I'm really excited about EA's like training up an m/l and being prepared to like help contribute to that project like figuring out whether ml is a good fit for them and then if so contributed to that project I guess I want to talk about that because I assume it's been covered on previous podcasts I also agree with Miles and Alan about those like a bunch of policy work and like a strategic work that seems also incredibly important I also want talk more about that I think some categories of work that I consider important that I wouldn't expect those people to mention I think for people who don't like for people with a background in computer science but not machine learning or who like don't I don't want to work on machine learning cause the side that's not the best thing don't enjoy machine learning I think there's a bunch of other computer science work that's relevant to understanding the mechanics of proposals like debate or amplification so an example would be like right now it's one of their projects is on factor cognition so in general and how can you take a big task and decompose it into pieces which don't depend on the entire context and then put those pieces together in a way that preserves like the semantics of the individual agents or the alignment of the individual workers so that's a problem which is extra important in the context of machine learning or in the context of iterative amplification but the one can study almost entirely independent of machine learning that is one can just say like let's understand the dynamics of study composition let's understand what happens when we like apply simple automation to that process let's understand what tasks we can decompose and can't as understand what kind of interface so like what kind of collaboration amongst agents actually works effectively so that's an example of like a class of questions which depend on like I'm sort of like well studied from a computer science perspective but aren't necessarily machine learning questions and which I'd be like really excited to see work on it was like similar questions in the debate space where like just understanding like how do we structure such debates do they lead to truth etc I think one could also study those questions not from a computer science perspective at all but I think it's like super reason like I don't know if you I think philosophers differ a lot and their taste they're like for example if you're a philosopher interested in asking a question about this area then I think under what conditions to debate lead to truth is not really a question about computers in any sense it's the kind of question that falls under computer scientist sensibilities but I think they like taking or really like sort of you know technical but not necessarily quantitative approach that question is like accessible it's like lots of people who want to try and help with AI safety and similarly for amplification so I think in both of those areas there's like questions that could be studied from a very clear science perspective and like involved like software engineering and involve running experiments and this also can be studied from like a more philosophical perspective just like thinking about the questions and about like what we really want and how alignment works they can also be studied from this more like psychology perspective like actually engaging like somewhere gonna run like relatively large scale experiments involving humans I don't know if things are like the time is right for that but it's like definitely there's definitely experiments based that do seem valuable I'm like it seems like that some point in the future there's going to be more of them sorry what do you mean by that so if you ask like how does this kind of decomposition work or how do these kinds of debates work like the decomposition is ultimately guided by right so I originally described this process involving a human and a few AI assistants ultimately you want to replace that human with an AI that's like drifting what a human would do but nevertheless like the way they were gonna tree in that system or the way we currently anticipate treating that system involves a ton of interaction like a machine is really just imitating or like maximising the approval of some human who's running that process and so in addition to caring about how machines work you care a ton about like how does that process work with actual humans and how can you collect enough data from humans to like how can you cheaply collect enough data from humans you can actually integrate this into the training process of powerful AI systems so I don't think that's a fact about like that doesn't bear on like many of the traditional questions in psychology and maybe that's like a bad thing to refer to doubts but it is like it involves studying humans it involves like questions about like particular humans and about how humans behave about how to like effectively or cheaply get data from humans which are like not really their questions machine learning people have to deal with because like we also deal with humans but really it's like a much larger machine learning people are not that good at dealing with the interaction with humans at the moment so yeah I think that's some family of questions anyway the ones I'm most excited about are probably like more in the philosophical computer science spent but I think they're like not I think they're also people who like wouldn't do great if work and the ML but would be grateful for working on those questions I think also like stepping back further setting aside implication of debate I think is just still like a lot of very big picture questions about how do you make AI safe that is like you could focus on some particular puzzle beakers also just consider like the process of generating additional proposals or understanding the landscape of possibilities understand the nature of the problem I don't know if you ever anyone from Erie on but I'm sure they would advocate for this kind of work and I think that's also like I consider that pretty valuable probably I'm more excited about at the moment about pushing I'm like our current was promising proposals since I spend a bunch of time thinking about alternatives and like it doesn't seem as great to me but I also think there's a lot of value to like clarifying our understanding of the problem or like trying to generate like totally different proposals trying to understand what possibilities like great yeah what we're planning to get someone on from Erie in a couple of months time perhaps when it when it fits better with their plans and they're they're hoping to hire so there we go we get some synergies between having the podcast and them actually having some jobs available any sense so make you a little bit more concrete what are opening eyes kind of hiring opportunities at the moment and in particular I heard that you're not just hiring ml researchers but also looking for engineers so I was interested to learn kind of how they helped with your work and and and how valuable those roles are compared to compared to the kind of work maybe that you're doing I think there's a sort of spectrum between like yeah there's a spectrum between research and engineering but I think people at opening I don't sit like either extreme of that spectrum so most people are doing some combination of like thinking about like more conceptual issues in ml and like running experiments and writing code that implements ideas and like then messing with that code thinking about how it works like debugging yeah there's like a lot of steps in this pipeline that are not that cleanly separated and so I think there's some anchors like value at the current margin from all the points on the spectrum and like actually at the moment or like right now I think I'm still spending or like the safety team is still spending a reasonably large but even people who are nominally very far on the research end are still spending a pretty large fraction of their time doing things that are like relatively far towards engineering so spending a lot of time like setting up and running experiments getting things working again the the spectrum between engineering research is like I think not that clean or ml is not in really a state where it's that clean so I think right now we're there's a lot of room for people who are more at the engineering side and he let me buy more at the engineering side is like people who don't have a background during research and m/l but you have a background doing engineer and who are like interested in learning about ml and willing to put in some time like on the order of months maybe like getting more experienced thinking about ml doing engineering related to ml I think there's a lot of room for that mostly so I mentioned these three problems the first problem was like actually getting an engineer experience to make say application or debate work at scale I think that involves a huge amount of getting things to work sort of by the construction of the task and similarly like in the third category of like trying to push safety out far enough that it's engaging with like that ml can actually be interacting an interesting way with human cognition I think that also involves again pushing things to a relatively large scale doing some research are like some work that's like more similar to conventional machine learning work rather than being safety and particular I think they're like both those problems are pretty important and both of them require like like are not that heavily weighted towards like very conceptual of machine learning work I think like my current take like I currently consider the second category of work does like figuring out you know from a conceptual perspective is this a good scheme to do seems like the most important stuff to me but also seems like very complementary with the other two categories in this sense the like Angkar our current philosophy which is I'm pretty happy with is like we actually want to be building the systems and starting to run experiments on them in parallel with thinking about like does this scheme like what are the biggest conceptual issues for some combination of like the experiments can also kill like even if the conceptual stuff work if experiments don't that's like another reason that that thing can be a non-starter and second that like you can run a bunch of experiments actually give you a lot of evidence about like help you understand the scheme much better and obviously independent of the complementarity actually being able to implement these ideas is important like there's obviously complementarity between knowing whether X works and actually having the expertise they're able to implement X right the good case the case that we're aiming at is a case where like we have both developed a conceptual understanding of how you can build blind AI but I've actually developed teams and like have have groups that are understand that and are trained to actually put it into practice in cases where it matters like we'd like to aim towards the world with is like a bunch of teams that are able to are basically able to apply cutting-edge ml to make systems that are aligned rather than unaligned that's again harking back to the very beginning of discussion we talked about these two functions of safety teams I think the second function of like actually make the eye aligned is also an important function obviously it only works if you've done the conceptual work but also the conceptual work is realistically the main way that's going to be valuable is if there are team that are able to put that into practice and that problem is like to a significant extent an engineering problem just quickly do you know what vacancies empanada has at the moment so I guess on the safety team yeah I mostly think about safety team on the safety team we were both very interested in hiring and all researchers who have a background in ml research like who have done that kind of work in the past or have done like exceptional work in nearby fields and are like interested in moving into ml we're also pretty interested in hiring like ml engineers that is people have done engineering worker may be interested in like learning or I've put in some amount of time so ideally these are people who are like either are exceptional at doing engineering related to ml or like are exceptional at engineering and have like demonstrated that they're able to get up feeding them Ellen are now able to do high quality work and again those roles are not like in terms of what they involved there's not like a clean separation between them it's basically just a spectrum yeah there's several different skills that are useful really looking for all of those skills like the ability to build things the ability of engineering to build use those parts of engineering the distinctive Tim L the ability to reason about safety the ability to reason about ml both those at a conceptual level so the safety team is currently looking for like the entire spectrum of stuff we do I think that's probably the case in several other teams within the organization desserted his large enough there's like a bunch of places now that like given a particular skillset well again given any particular skill set on that spectrum there's probably a place the organization overall is not that large really equipped the scale of sixty full-time people I think they're still like a lot of roles that don't really exist that much like that would at a very large company but there's a lot of engineering to be done a lot of conceptual work to be done and a lot of like the whole space in between them yeah how does looking at opening I compared to deep mind and another other top places that people should have have a forefront of their brains you mean in terms of like my assessment of impact or in terms of like the extremes day-to-day I think in terms of impact mostly yeah I don't think I have a really strong view on this question I think it depends in significant part on things like where you want to be and like which particular people you're most excited about working with I guess those are gonna be the two biggest inputs yeah I think they're like both teams are doing like reasonable work that accelerates safety both teams like getting experience implementing things and understanding like how you can be integrated into an AI project I'm optimistic that like over the long-run there will be like some amount of consolidation of safety work out like wherever happens to be the place that is designing like the AI systems which it's most awesome a question that quite a few listeners wrote in with for you was how much people who are concerned about a alignment should be thinking about moving into computer security in general and what's the relationship between computer security and an AR safety I think it's worth distinguishing two kinds or two relationships between security and alignment or like two kinds of security research so one would be security of computer systems that interface with or are affected by AI so this is kind of like the conventional computer security problem but now in a world where exists or maybe you like even aren't focusing on the fact that they exist and are just thinking about conventional computer security so it's like one class of problems there's a second class of problems which is like the security of ML systems themselves like to what extent can an ml system be manipulated by an attacker or to what extent does an ml system continue to function appropriately in an environment containing an attacker so they've got different views about those two areas so on on the first area computer security broadly I think my current feeling is that computer security is quite similar to other kinds of conflict so that is if you live in a world where it's possible it's like a tech you know someone's running a web server is possible to compromise that web server like lunch people of computers it's possible to effectively steal resources from that more like steal time on their computers that's very similar to living in a world where it's possible to like take a gun and shoot people and I like regret in general I love it if there are fewer opportunities for destructive conflict in the world like it's not great if it's possible to steal stuff or blissed up or so on but from that perspective I don't think computer security is like I think they're like the core problem in AI alignment is like the core question is can we build a systems that are effectively representing human interests and if the answer is no then like there are enough forms of possible conflict that I think we're like pretty screwed in any case and if the answer is yes if we can build powerful AI systems that are representing human interests then I don't think cyber security is like a fundamental problem anymore than like the possibility of war is a fundamental problem like it's bad it's like perhaps extremely bad but like we will be able like at that point the interaction will be between yeah systems representing your interests and as systems representing someone else's interests or AI systems representing no one's interest I like at that point I think the situation is probably somewhat better than the situation is today that is like I expect the cyber security is less of a problem in that world than is this world if you manage to solve alignment that's my view on like computer security that's not like sort of conventional computer security and how a lime interfaces it with it I think it can be made obviously things like quantitatively computer security can become somewhat more important or this intermediate period or like as especially good at certain kinds of attacks and maybe not as useful while you may end up being not as useful for defense and so one might want to intervene on like making AI systems more useful for defense I think that doesn't have like outsized utilitarian impact compared to other cause areas in the world I think security of ML systems is a somewhat different story mostly because I think security of ML systems like intervening on scooter the ML system seems like a very effective way to advance alignment to me so if you ask like how our alignment problem is likely to first materialize in the world like supposing that I have built some AI system that isn't doing exactly the thing that I want and the way that that's likely to first show up is in the context of security so if I built like a virtual assistant that's representing my interests on the Internet it's like a little bit bad if they're not exactly aligned with my interests but in a worlds containing an attacker that becomes like catastrophic ly bad often because an attacker can like take that wedge between the values of that system and like my values and they can sort of create situations that exploit that difference right so for example if I have in the eye that like doesn't care about some particular fact like it like doesn't care about the fact that ilk uses up a little bit of network bandwidth whenever it like sends this request but like I would really care about that because I wouldn't want to keep like sending requests arbitrarily then an attacker can like create a situation where like my AI is going to become confused and like because it doesn't isn't attending to this cost attackers motivated to create a situation where the AI will like therefore pay a whole bunch of the cost so I'm motivated to like trick my AI that doesn't care about sending messages into sending very very large numbers and messages or like if my AI like normally behaves well then there exists this like tiny class of inputs like with very very small probability it encounters an input that causes it to like behave maliciously like and that will appear eventually in the real world perhaps and it's like part of sort of the part of the alignment concern is that that will appear naturally in the world with small enough probability or as you run a long enough but like it will definitely first appear when an attacker is like trying to construct a situation in which my eye behaves poorly so I think security is like this interesting connection where like many alignment problems not literally all but I think a majority you should expect to appear first as security problems and as a result like I think security is sort of one the most natural communities to do this research in when you say an attacker would try to do these things what would be their motivation their ask that would depend on exactly what yeah system it is a bit like like a really simple case would be if you're a virtual assistant going out to make purchasing decisions for you they're like what makes those decisions is like slightly wrong they're like a thousand agents in the world they're a thousand people who would like that virtual assistants like send them some money so if it's possible it's like manipulate decision they use this for deciding where to send money then like that's a really obvious thing to try and attack if it's possible to cause it to like leak your information so suppose you have any ayah which like has some understanding of what information like of what your preferences are but doesn't quite understand exactly like how you regard privacy there's there are ways of leaking information that it doesn't consider a leak but it has a like almost but not completely correct model of what constitutes leaking then like an attacker can use that to just take your information by setting up a situation where like the eye doesn't regard something as a leak but it is a leak if there's like any difference between what is actually bad and what your ad considers bad then the attacker can come in and like exploit that difference if like there's some action that would like have a cost to UI benefits the attacker then the attacker wants to set things up so that like your AI system is not recognizing that cost to you so taking money taking information using your computer to like launch other malicious activities like run denial of service just causing destruction like there's some fraction of attackers who just want to like like run denial of service attacks it's like if you can compromise integrity of a bunch of AI systems people are using that's like a bummer maybe they want to like control what content you see so if you have ad systems that like mediating like how you interact with the internet like you know your AI says like here's something you should read there are tons of people who would like to change what your yeah I suggest that you read just cuz like every eyeballs worth a few cents do you think to play that at scale it's like a lot of sense it's like that's the kind of situation where some of those problems aren't alignment there are a lot of security problems that aren't alignment problems but I think it's the case of like many many alignment problems are also security problems so if one were to be like working in security of ML with an eye towards working on those problems that are also alignment problems I think that's actually a pretty compelling thing to do from a long-term safety perspective so it seems to me like AI safety is a pretty fragile area where it would be possible to cause harm by doing kind of subpar research or having the wrong opinions or giving the wrong impression you know being a big kind of a loudmouth who has not not terribly truth tracking views like how how high do you think the bar is for people going into this field without causing harm like is it possible to be kind of at the 99th or 99.9 percent isle of suitability for doing this but still on balance like not really do good because just like the kind of unintentional harm that you do outweighs the waste the positive contribution that you make so I do think the current relationship between if the a Li min community or safety community and the ML community is a little bit strange and that you have this yeah this weird situation was a lot of like external interest in safety in alignment a lot of external funding like a lot of people on the street like sort of sounds like a compelling concern to them that causes a lot of people in machine learning to be kind of on the defensive that is they see a lot like external interest that's like often kind of off-base like doesn't totally make sense I'm they're concerned about like policies that don't make sense or diversion of interest from like issues they consider important until like some incoherent concerns so that means again they're like a little bit on the defensive in some sense and as a result like I think it's kind of important for people in the field to be a reasonably like respectful and not causing trouble because there's like more likely than in most contexts like actually cause a or a hostile response I don't know if that's like much of a property of people like I think someone who believed that this was an important thing like if you're at the point we were like yep I'm really concerned about like causing like political tension or like really rocking the boat that's not a good sign yeah I think at that point like if you're at that point and you're like basically behaving sensibly then I think things can probably be okay I mean I've definitely sometimes like I have from time to time like caused some distress or I like run into people who are like pretty antagonistic towards something I was saying but I mostly think if you care about a lot by being sensible and I'd be like very surprised if the net effect was negative I think a lot of people don't care about it very much they would like disagree with this position and say they're like look this is actually like the reason people aren't agonistic is not because they like being reasonably concerned about like Outsiders really don't have a clear understanding pushing bad policies the reason that they're defensive is just because they're like being super silly and so like it's just time for a showdown between people who silly and people have sensible views and like if you're coming in with that kind of perspective then like I mean isn't with this question is not interesting to you because really get Paul's just uh it's just one of the silly sympathizers yeah it's not good that I'm allowed to give recommendations to people like that or that they would it's not clear that they would be interested in the recommendations I would recommend like just as part of like a compromise perspective like if you have that view then there exists other people like Paul have a different view on which like there are like some reasonable concerns of them wants to behave like someone respectfully towards those concerns it'll be good if we like all compromised and just didn't destroy things I really piss people off so if we imagine kind of you and your colleagues and you know people who kind of similar to you in other organizations before you got involved in AI safety but you had like kind of your your skills and your talents and interests but I would say that you can't work on AI safety what do you think you should have done otherwise yeah so by camera kinda has safety main like let us ignore all of the impacts of my work and via the effect on se like a natural thing that I might have done would be going to AI and I'm like AI seems important independent of alignment it seems like as reasonably likely like as a person with a sort of technical background yeah it kind of seemed especially in the past this is more obviously silica neglected in this argument past like it seemed that it was a good ratio of like effect of the area to like congestion or number of people trying to work on it and it was a good match for my camper advantage okay yeah let's maybe set that aside as well cuz it seemed pretty similar in this period the question yeah so sitting aside all of AI and like sort of let's just I just like everything that's having an effect via the overall importance of AI I am pretty excited about overall like improving human capacity to like make good decisions make good predictions as our coordinate well etc so I'm like pretty excited about kind of thing I think it would be a reasonable bet so that includes both stuff like some of these things aren't good fit for making period managers probably now what I should do so examples of things aren't good for my comparative manager like you know understanding like pharmacological interventions to make people smarter understanding just like having a better map of like determinants of cognitive performance like how can you quickly measure current performance what actually determines like how well people do a complicated messy tasks in the real world so you can your beam on that I think that's like an area where like science can add sort of a really large amount of value like it's very very hard for a firm dead value and that's basic compared to like a scientist those are just gonna discover facts and like you're not gonna be able to monetize them very well probably that's an example of like improving human capacity in a way that I probably wouldn't have done because not a great fit for my abilities things that are better fit for my abilities are like stuff that's more about like what sort of institutions or mechanisms do you use I don't know if I would have worked on that kind of thing I might have so an example of a thing I might work on is like a little bit more law and economics oh yeah some example of a thing that I like find very interesting is like the use of decision markets for collector decision-making um and so that's like an example of an area that I would seriously consider um and I think there's like a lot of very interesting stuff that you could do in that space that's not an area thought about a huge amount because it seems like significantly less i leverage than AI but it is like a thing which i think there's like a lot of more mathematical work to do and so if you're avoiding AI and you're like where does math oh really like I'm almost only gonna be working in some area it's like very very very similar to like theoretical computer science in terms of what skills it requires yeah I guess other other kind of key questions in that field that stand out as being particularly important in maths computer science other than AI related things and so definitely moves to the questions people ask I think are if they're irrelevant at all primarily relevant through in effect on the I so I don't know how much exactly I mean I took those all the table maybe that was too much I think they're like a basic problem is if you really care about differential progress so effective ultras tend to have this focus on because no matter if we get somewhere faster most of the matters like what order technologies are developed in what trajectory were on I think really a lot of the things people work on are like a lot of things people work on in math or computer science are like founded on this based on this principle like we don't know how this is gonna be helpful but it's gonna be helpful in some way I think it's often like a valid argument but I think it's not helpful for eventual progress or like I need a different flavor that argument you want to say it's hard to say we don't know how there's gonna be helpful but we believe it's gonna be helpful to things that are specifically yeah so like I think a lot of stuff in math and computer science is less appealing from like long run ultras perspective because of that I think stuff on decision making in particular like what kinds of institutions do you like oh yeah I think I was very interested in it did work on it my thesis was just like this is giant family problems like you have end people they'd like you to have access to some local information I would like to make some decisions but you can formalize this inner problems in that space like they would like to decide what to produce and what to consume and like what to build so I'm just asking this question saying what are good algorithms that people can use so you're really amusing your science question yeah I don't know that much about these areas but it's like very exciting I don't care yeah you may not have anything need to say about this one but like what would you say are the most important ways that people in the effective altruism community are approaching AI issues incorrectly so I think one feature of the effective altruism community is it's like path dependence or founder effects or like people in ei are interested in a safety are often sort of very informed by this very perspective for the very sensible reason the like very vocal and boström earliest people like talking seriously about the issue um says like the cluster of things that I would regard as errors that come with that um so like some perspective on like how you should think about sophisticated I systems so for example like very often thinking in terms of like a system that has been given a goal or this is actually not a mistake that Miri makes this is a mistake many is make as many as would like to think about in the AI is being handed some goal or like an explicit representation of some goal and the question is just how do we choose that explicit representation of a goal such that pursuing it leads to get outcomes which i think is like an okay model of AI to work with sometimes but it's like mostly not like certainly not a super accurate model and most of the problems in the alignment don't appear in that model so it's like a kind of error again atropine I would to marry someone fair and then marry themselves wouldn't make this error it is a consequence of people it's kind of a basa dies version of a view that's really like add an analogous thing is like I think that the way you should be thinking probably about building FBI systems is like I'm war based on this idea of corage ability this is like as systems that are going along with what like helping people correct them helping humans like understand what they're doing and like overall participating in a process that points in the right direction rather than attempting to communicate the actual like what is valuable or having a system that embodies like what humans intrinsically want in the long run yeah something that's like a somewhat important distinction and that's kind of intuitively if an animal person talks about this problem they're really going to be thinking about it from that angle they're gonna seem like great we want our AI to not kill everyone we wanted to like help us understand what's going on it cetera and so like sometimes DA's couldn't the perspective of like but consider the whole complexity of moral value and like how would you communicate that to an AI I think that is like an example of the mismatch that's probably mostly due to an error on the EA aside though it's certainly the case that this is concept like cordial is a complicated concept and like if you actually think about the mechanics of how that works it's like there's a lot more moving parts than the normal ml perspective kind of suggests or like again it's not even really an ml perspective it's like the knee-jerk response of someone who's been actually thinking about ML systems I guess I have like the difference of views with like I think is often have maybe also her founder effective reasons like actually no I think for complicated reasons they tend to have a view where development of AI is likely to be associated with both like sort of very very rapid changes and also like very rapid concentration of power mm-hmm I think they like is overestimate the extent to which like the probability of that happening this is like yeah it's certainly a disagreement between me and must EA's like I think it's much more like they're gonna be in the regime where there's like reasonably probably distributed AI progress and like they it's getting deployed a whole bunch all around the world and like maybe that happens rapidly like over the timescale of a year or two years that the world moves from something kind of comprehensible something radically alien but it's not likely to be like a year during which like somewhere inside Google as being developed and at the end of the year rolls out and takes over the world it's more likely to be year during which like just everything is sort of in chaos like the chaos is very broadly distributed chaos as AI gets rolled out is it possible that there'll be better containment of the intellectual property such that other groups can't copy and one group does go substantially ahead I mean at the moment almost all AI researchers published publicly such that it's like relatively easy to replicate but that may not remain the case yeah so I think there's definitely there's definitely this naive economic perspective on which this would be incredibly surprising namely like if you have this so in this scenario whereas about to take over the world then like and it's driven primarily by like progress and AI technology rather than like control of large amounts of hardware then like that intellectual property now like you know the market value if you were to market to market it would be like attentively in dollars whatever so you sort of expect the like an actor who is like developing that like the amount of pressure like competition to develop that would be very large you expect like a very large coalition's to be in the lead over like small actors it's like it wouldn't go like not quite at the scale where they can plausibly do it you could imagine like so if all of Google was involved in this project that becomes plausible but then again you're not like imagining like a small group in their basement you're imagining like an entity which was already producing on the order of like you know trillion head value it was already valued on the order multiple trillions of dollars taking some large share of its resources into this development project and that's kind of conceivable like you know develop while going from like five trillion dollars two hundred million dollars like that's a huge jump this is 20x in value how do jillion dollars being your value if you take over the world like the 20x is a huge jump but like that's kind of in the regime of what's possible or as I think like a billion dollars to taking over the world is just like super implausible say there's like a naive economic perspective which like makes that prediction very confidently to compare that to the real world you have to think about like a lot of ways in which the well it's not like sort of an idealized simple economic system but I still think it'll be the case that probably a development will involve life from very large collisions involving very large amounts of hardware large numbers of researchers I'm regardless of like if intellectual property is contained really well but it might take place within a firm or like a like tightly coordinated cluster of firms rather than distributed across like the academic community in fact I don't I would not be at all surprised at the academic community to play super large role but then the distinction is between like distribute across a large number of loosely coordinated firms versus distribute across like a network of tightly coordinated firms and like in both cases it's a lot is a big group it's not I got small group beam coverage and like once you're in the regime of that big then like yeah I mean probably what ends up happening there is like the price so if it's like Google's doing this unless they're like in addition to being really tight about IP also really tight about what they're doing like you see the share price of Google like start growing very very rapidly in that world and then probably like yeah as that happens eventually you start running into like problems where you really can't scale markets gracefully and then policymakers probably become involved at a point in the market is dating like Google is roughly as valuable as everything else in the world everyone is like geez this is like some serious Google's interesting case actually cuz corporate governance at Google is like pretty poor so like google has this interesting property where like it's not clear that winning a share of Google would actually entitle you to anything if Google were to take over the world many companies like somewhat better government than Google in this respect so actually explain that so Google like sort of famous for like shareholders having very little influence and what Google does so if Google hypothetically would have this massive windfall like it's not really clear like it would be kind of a complicated question what Google as an organization ends up doing with that windfall and Google seems like kind of cool I like Google they seem nice probably like they did something good with it but it's not obvious to me that being a shareholder in Google that like gives you you don't get the dividend you could sell the shares well you get the dividend but it's not clear whether there would be a dividend so like for example most shares that are sold in Google is saying this possibility of like retaining the earnings to just invest in other things and never gets head build some Google City Fillmore's projects it's interesting in particular most shares of Google that are trader or non-voting shares I think I don't actually know very much about Google's corporate governance these are the famous two classes that already yeah and so I believe a majority of voting shares are still held bit like three individuals see so I think the turtles don't have any formal power in the case of Google essentially yeah the other question is like informally there's an expectation and again like if you're taking over the world like formal mechanisms are already probably breaking down there's also plenty of surplus to distribute well the other depends on what you care about so from the perspective in general like as the eyes developed from the perspective like humans living happy lives there's sort of massive amounts of surplus people have tons of resources from the perspective of like if what you care about is relative position or like owning some large share of what is ultimately out there in the universe then there's in some sense there's only one University or around that people will be giving it up so I think like the people who are mostly interested in living happy lives and like having awesome stuff happened to them and like having their families and friends I'll be super happy like those people always gonna be really satisfied it's gonna be awesome and then the remaining conflict will be amongst like either like people who are very like sort of greedy in the sense they just want as much stuff as they can have or like states that are very interested in like ensuring the relative prominence of their state things like that utilitarians I guess are one of the like offenders here where utilitarian wouldn't feel like yeah it's green I got to live a happy life utilitarian is like that have like linear returns to more resources more than most people do yeah I guess like yeah any kept well yeah any Universalist moral system may well have this property oh I'm actually not necessarily but most of them yeah I think a lot of impartial values generally have yeah another blog post he wrote recently was about how valuable it would be if we could create an AI that didn't seem a value aligned and yeah whether that would have any value at all or whether it would basically mean that we've gotten zero value out of the world schedule what explain what your argument wasn't as I think this is a perspective that's reasonably common again in my community and in like the broader like academic world or in broader intellectual world namely you build some very sophisticated system one thing you could try and do is you could try and make it just want what humans want nothing you do is you could just say great it's some sophisticated some like a very smart system that has like all kinds of complicated drives like maybe it should just do its own thing maybe we should be happy for it maybe like you know in the same way that we think humans were an improvement over bacteria we should think no I could say I we built it's an improvement over humans should live its best life yeah I think it's not an uncommon perspective I think people in the Lima community often are like pretty dismissive of that perspective I think it's like a really hard I think like people on both side like both people who sort of accept to that perspective intuitively and people who dismiss that perspective I think like haven't really engaged with how hard I like a moral question that is yeah I think it's like extremely I consider extremely not obvious I like I'm not happy about the prospect of building such an AI just because it's kind of an irreversible decision or like handing off the world's to this kind of a we built somewhat in versatile decision and it seems unlikely to be optimal right yes I think that's that I guess I would say like half as good if it's like half as good as humans doing their thing I'm like not super excited about that that's like just half as bad as extinction like again I'm trying to avoid that outcome would then be on behalf as important as trying to avoid extinction but like again the factor of two is not going to be decisive I think the intersting question is yeah I think the main interesting question is like is there a way is there such an AI you can build it would be like close to optimal and I do agree that like a priori like most things aren't gonna be close to optimal it'd be kind of surprising if that was the case I do think there are some kinds of the eyes that are very inhuman for which it is close to optimal and like understanding that border between you know when that's very good like when we should sort of as part of being cosmic citizen should be happy just build AI versus when that's like a great tragedy it's like maybe an important it's important to understand that boundary if there is some kind of AI you can build it's not alignment still good you know I've been certain that post I made like a fuel I both like made some arguments for why there should be some kinds of the eyes that are good despite not being aligned and then I also tried to like push back a little bit against the intuitive picture some people have this the default yeah so I guess the intuitive picture in favor is just it's good when agents kind of get what they want and this they say I will well want some things and then it'll go about getting them and that's that's also good and the alternative view be yes but it might exterminate life on earth and then fill the universe with something like paper clips or some some random thing that doesn't seem to us like it's a legal at all so the water what a complete waste that would be is that right ah that's definitely like a rough first pass that's basically right there's definitely a lot that can be said on the topic so for example someone who has the favorable view could say like yes it would be possible to construct an agent which like wanted a bunch of paperclips but such an engine like would be unlikely to be produced you don't have to go out of your way in fact like maybe the only way to pre such an agent is if you're really trying to solve a line image you just like tried to run something like evolution then like consider the analogy to evolution humans are so far from the kind of thing they would like yeah so one position would be the other exist such bad guys but if you run something like evolution you'll get a good AI instead and so then that perspective might then be optimistic about them like the trajectory of modern ml that is from some like on some alive intersectional this is really terrifying we're just doing this black box optimization who knows we're gonna get from some other perspectives you like well that's what produced humans so you should pay it forward I think also people get a lot of mileage out up for like normal analogy to descendants that as people say well we like would have been unhappy had our ancestors like been really excited about controlling like sector of our society and tried to ensure their values were imposed on the whole future and like likewise even if like our relationship to I systems we built is different than the leadership of our ancestors to us it has like this structural similarity and you know likewise the I would be annoyed if like we went really out of our way and paid large costs to constrain that feature directory of civilization is only to be like you should be nice and like doing to others who have them doing tests I don't find that persuasive personally persuasive like out of the box yeah yeah it just seems very different because I guess we're very similar by design to humans from 500 years ago just with lot probably more information and more time then think about what we want or as I think you can't just yet look at any I might be just so differently designed that it's like a completely different job where it's like from up one of you could be well yeah and I'm a McMahon yeah I'm course impelling so I don't really lean much on this like I don't take much away from the analogy to descendants thought about but it's not gonna run much of the argument yeah I think the main reason that you might end up being altruistic towards like say the kind of product of evolution would be if you said like from behind the veil of ignorance humans have some complicated set of drives etc if like humans go on controlling earth like that like set of values and preferences humans have is going to get satisfied if we were to run some new process that's similar to evolution it would crease a different agent with a different set of values and preferences like from behind the veil of ignorance like it's just as likely to our preferences would be the preferences of the thing that actually evolved on earth is that our set of preferences would be the Preferences of this AI that got created so like if you're willing to step far enough back behind the veil of ignorance yeah then you might say like okay I guess 50/50 yeah and I think there's some conditions under which you can make that argument tight and so like even like a causal a perfectly selfish causal decision theorist would in fact sure like these normal weird it causal trading reasons would in fact like want to look the AI like would be for the yeah and there's a question of like outside of those very extreme cases where there's like a really tight argument you should be happy like how happy should you be if there's a loose analogy between the process that you ran and biological evolution so what do you think of kind of the best arguments both for and against thinking that a non-aligned or what like online day I would it would be morally valuable so I think it certainly depends on which kind of aniline di we're talking about so one question is like what are the best arguments that there exists an aniline dye which is morally valuable and another question is like one of the best arguments like some particular like a random AI is morally value etc hmm so it's the best argument for the existence which i think is an important place to get started or like if you're starting from this dismissive perspective like most people in the Ottoman community sort of have intuitively I didn't the existence argument is really important for a step I think the strongest argument on the existence perspective is consider the hypothetical where you're actually able to in your computer career like a nice little simulated planet from exactly the same distribution is like you know earth-like run earth you run evolution on it you get something very different from human evolution but it's exactly a draw from the same distribution yeah do you think it's like 50/50 whether it's likely to be better or worse than us on average right well from our values it might be much like having conditioning now on our values yeah definitely be much worse yeah conditioning will be agnostic about what values are good yeah that's right or like it's it's a really complicated math philosophy question then the extreme any we can even make it actually tighter so if you were to just make such a species and then like let that let that go in the universe I think then you have a very hard question about whether that's a good deal or a bad deal I think you can do something a little bit better you can do something a little bit more clearly optimal which is like a create if you're able to create many such simulations run evolution not just once but many times look across all the resulting civilizations and pick out a civilization which like is constituted such that it's going to do exactly the same thing you're currently doing such that when they have a conversation like this they're like yeah sure let's like let out that like let's just run evolution and like let that thing prosper then kind of like now civilizations who follow the strategy are just engaged in like this musical chairs game where like each of them started off evolving on some worlds and then they like randomly stimulate different one of them and then like that takes over that world so like you have exactly the same set of values in the universe now yeah like across the people who adopt this policy just shuffled around yeah so it's clear that like it's better for them to do that than it is for them to like save face some substantial risk of building on a line they I okay so I didn't understand this in the post but now I think I do Sotiris like imagine that there's a million universes or with like different different versions of earth or somewhere like life has evolved yeah if you a really big University can imagine that literally just all copies of exactly the same solar system on which the evolution went a little bit differently yeah and so they will end up with somewhat different values and you're saying if but if they're all if all of their values imply that they should just reshuffle their values and like yeah run a simulation and then like just be just as happy to go with whatever that spits out as what they seem to prefer then all they do is kind of trade places on average they all just like you'll just end up with like different drawers from this broad distribution like possible values that people can have across this this somewhat narrow but still broad set of welds but you're saying this is better because they don't have to worry so much about alignment so it's oh you mean why are things better after having played after having yeah why does the musical chairs thing where everyone just everyone just flips values on average with it with other people produced a better outcome like in total yeah so I think this is most directly relevant as an answer to the question why should we believe there exists a kind of AI that we would be as happy to build as in a line they are on the line but in terms of like why it would actually be good to have done this like the natural reason is we have some computers the concerning feature of our current situation is that like human brains are not super like we have all these humans we're concerned that the AI is running on these computers are gonna be better than humans such that we're sort of necessarily going to have to pass control over the world off to things running on these computers hmm and so like after you've played this game musical chair is now like the new residents of our world are actually running on the computers so now you'd sort of like as if you got your good granulations for free that is now you those people who have access to simulations of their brain can do whatever it is they would whatever you would have done with your AI they can do with themselves okay it's a yeah there's really a lot of moving parts here and like a lot of ways this maybe doesn't make any sense okay let me just sing so if we handed it off if we handed off the future to an AI that was running a simulation of these worlds and like and you think that has a reference point for like water should value on average from this very like abstracted point of view this would be no worse and if everyone if all of the people in this leg broad said did this then they would like save a bunch of trouble trying to like get the air to do exactly what they want in that universe that's a could order could just kind of trade with one another well they all they all get to save the overhead trying to make the AI align well with them specifically instead they have to align it to some like how the poll that they've created of like oh yes I'm evolutionary process that it then like that listens to inside the computer and the concern is presumably not the overhead but rather the risk of failure like if you think there's a substantial risk the you would build a kind of AI which is not valuable then like this would be really great so that's we it's our current state we're like we might build an air that does something no one wants we could instead building the AI that does something that we want to be a second a third alternative which is like the same as the good outcome between those two is just to build an AI that does you know reflects values that are the same from the same distribution of values that we have okay so you try to align it with your values and if you fail I think well there's always this backup option that maybe will be valuable anyway this is a like plan deed yeah and so it mostly be relevant and again to be clear this this weird thing with evolution is like not something it's gonna get run because you can't sample from Zappos an institution's evolution they would just prompt the question what class of a eyes have this desirable future you believe at least one does any I would be a plan B so the reason to like work on the this moral question what class of a eyes are we happy with despite not being aligned with us and like the reason to work on that moral question would be that if you had any reasonable answer that like it's an alternative to doing alignment if we'd like a really clear answer to that question then like we could be okay anyway even if we mess up alignment okay so this would be a yeah I see it would be an alternative approach to getting something that's valuable even if it's not aligned in some narrow sense with us yeah and it might be an easier problem to solve perhaps that's not I mean on my list of moral philosophy problems it's like my top my top rated moral philosophy problem I think not that many people have worked on it that long so if you were a moral realist he just believed that their objective moral facts they should be totally fine with this kind of thing like I don't from their perspective well I think that humans are better discovering objective moral facts then yeah I don't know more of those positions very well they're like my understanding some well well let's go for that hmm but I guess they might they might look at humans and say well I do just think that we've like done better than average oh they're better than you would expect doing this for example we we care about this problem to vote to become worth for a sec many other agents just might not even have the concept of morality so in that sense we've like we were like we were in the top half maybe no like at the very top but but oh you know I wouldn't roll the dice completely again but then it seems like they should also then think there's a decent chance in fact if we did okay it suggested there's a decent chance that if you roll the dice again as you get something somewhat valuable because it would be an extraordinary coincidence if we managed to do really quite well tomorrow yeah I'd figured I'm more realism figuring out what these model facts are but then like if it was extremely improbable for that to happen to begin with yeah I would it's definitely if you're more or less we have different views on this question it's gonna depend a lot on like what other views you take on like a bunch of related questions like I don't have I'm not super familiar with like coherent we're all realist perspectives but like I'm my kind of perspective sort of if you make some moral errors early in history it's not a big deal as long as you like our own sort of the right path inspection well deliberation yeah so you might think like from the realist perspective there'd be a big range of like acceptable outcomes and you could in fact be quite a bit worse than humans as long as like you were again almost like right path respecter deliberation yeah I don't quite know how a moral realists feel about deliberation like would they say they're like this inaudible yeah I think there's like probably a lot of disagreement amongst or listens it's not a yeah but then if you're just a total subjectivist so you think there's like nothing that people ought to think is right instead you just kind of want what you want well what do you care at all about like well what other people in different hypothetical runs of evolutions would care about like wouldn't you just be like completely what I don't even care what you want like all I care is about what I individually want and I just want to maximize that yeah so then you get into this like decision theoretic reasons to behave kindly it's like the basic the simplest pass would be from behind like if you could have made a commitment before learning your values to like act in a certain way then that would have benefited your values and expectation so if there are similarly if there are like logical correlations between your decisions and decisions of others with different values then that might be fine like even on your values it might be correct for you to make this decision because it correlates with okay other decision like me by others in like the most extreme case at some point I should caveat this entire like last tower maintanance betina's discussion oh yeah this is a bunch of weird do normal stuff anyway then you get it's like weird we're like once you're doing this like musical chairs game then like one step of that was you ran a bunch of simulations and saw which ones were unclear participate in the scheme you're currently running and so like from that perspective like us as humans who take well we might as well be in such a simulation in which case like even on our narrow values by running the scheme like we're gonna be the ones chosen until I take over the outside world why are you more likely to be chosen to go to the outside well if you're cooperative ah so like the scheme which would run if you want use a musical chairs thing yeah you can't just like simulate a random species and let it take your place okay because that is just then gonna move from those species that run this procedure they're all gonna give up their seat ice arena and replace them so you end up it's like evolutionarily bad strategy that's not that strategy yeah the thing that might be an okay strategy is you run the scheme and then you test for each species before you let them replace you did they also run the scheme and she's from the co-operative ones yeah yeah and then that would house the incentives to be yeah I think this does get a bit weird once we're talking about the simulations I think the earlier parts were like more norm yeah yeah the question of just that yeah whether an AI would be morally valuable uh it seems like much much more mainstream okay yeah it gets also more importantly I think this weird stuff with simulations oh he doesn't matter whereas I think the question like morally like how valuable is it to have this AI which has values that are kind of like from similar distribution to our values I think that's actually pretty important I think it's relatively common for people to think that would be valuable and second not a positional I mean people have engaged with that much it's not a question to my knowledge the moral philosophers have engaged with that much a little bit like not I guess maybe they come from different perspectives and I would like watch to attack the question from yeah in case of moral philosophers yeah uh is another points that I'm also like kind of scared of this entire topic and they're like I think a reasonably likely way that like AI being out of line ends up looking in practice is like people pulled a bunch of AI systems they're like extremely persuasive and personable because you like optimize them to they can be optimized effectively if you're like having whatever superficial properties you want so you live in a world with just a ton of AI systems that want like random garbage but they like they look really santa thetic and they're like making really great pls they're like really this is incredibly inhumane like killing us after this or the selecting us to like I'm closing your values on us yeah and then expect I didn't think this sort of current way over a little bit like intellectual consensus goes it's like I'm really to be much more concerned about people being like fing at it or failing to respect the rights of AI systems than to be concerned about the actual character of those systems right I think it's like I pretty likely tell you modes in the end concerned about interesting I hadn't really thought about that scenario so the idea is here we create a bunch of a eyes and then we kind of have an AI justice movement that like that gives a is maybe more control or like yeah more control the world and more moral consideration and then it turns out that while they're like very persuasive at like advocating for their moral interests in fact their moral interests when they're given like more autonomy nothing like ours well much less than they seem then went back to this question which was unclear if like how value bought like maybe that's fine I don't actually wanna super strong view on that question I think an education I'm not super happy about it but but by kind of odd going for like the moral rights of a is you you're making the scenario more possible cause well I mean I mostly think it's gonna be like I strongly suspect there's gonna be serious discussion about this in any case and like I would prefer that there be some actual figuring out what the correct answer is prior to like he'd becoming an emotionally charged or politically charged issue mmm I'm not super confident also to be clear but anything we're saying here this is not like eighty percent of youth is real like 40 percent views yeah an example would be like often when we talk about failure scenarios I will talk about like they're a bunch of automated because these autonomous corporations that control resources and they're mastering resources that no human gets to use for any purpose and people's responses like that's absurd we would just like say look legally you're just a machine you have no right to own things we're gonna take your stuff yeah that's like something that I don't think is that likely to happen like I suspect that like the extent lots of resources are controlled by a systems those air systems will be like in the interests of preserving those resources will make like fairly compelling Appeals for respecting their rights I'm in the same way that like a human would if you were like if all the humans get around and like yeah we're just gonna take it just like such terrible optics and seems like so much not a thing that I expect our society to do like everyone just been like we're gonna take all these actors resources we just like don't think they have the right to like self-determination interesting it seems like the seems like the default to me but maybe not I guess the issue is that the eyes we're going to advocate for themselves without human assistance potentially in a way that like a corporation can't a corporation is still made of people they're like dude do corporations like make an argument that like I'm a separate entity and I like deserve rights and should be able to like a mass resources that don't go to shareholders so his problem is like they're it's controlled by shareholders so it's all about the bottoms out of people in some way yes AI doesn't necessarily I think it was both the case the corporations do in fact have like a level of rights that be sufficient to run the risk argument so that if the outcome was the same as corporations that would be sufficient to be concerned but I also think that corporations are like both yeah they do bottom out with people in a way that these entities wouldn't and that's like one of the main problems and also they're just not able to make persuasive arguments that is like one they're not able to represent themselves well like they don't have like nice like ability to articulate eloquent arguments that like plausibly originate with like this actual moral patient making the arguments yeah and the 2d acts when were all cases more straightforward for corporations where's the think free eyes they will actually be huge amount of ambiguity and I think the sort of default again from if you interact with like people who think about these issues some right now we talk to random academics we think out philosophy and AI are you're like look at Hollywood movies that are like somewhat less horrifying than like terminator I think the normal theme would be like yeah by defaults we expect once since agents are as sophisticated as humans they like are deserving of moral consideration for the same kinds of reasons humans are and it like it's reasonably likely that people will denied in the moral consideration but that would be like a terrible moral mistake I think that's like a I've been normal not a normal view but that's like if I were to try and guess where like opinion is heading or it'll end up that'd be my guess yeah I guess I feel like the hey guys probably would demand like a deserve moral consideration and so like they're real sure yes the situation that that's true but then there's this question of it's like they deserve moral consideration as to their I suppose because I'm like sympathetic to hedonism I care about their welfare and then I don't especially R as do i yeah as we all should but I don't necessarily then want them to be able to do everything like do whatever they want with other resources which is the neutral I guess it I mean I feel that way position do that no but I feel that way about other people as well and this is Sarah Lee right that I like I want I want other people on earth to like have high levels of welfare but that doesn't necessarily mean I want to hand over the universe to whatever that whatever they want I just think it makes the character this to be like a lot more contentious if you're like yeah everyone agrees that there's this giant class of individuals which is potentially reasonably large which like currently does some large fraction of labor in the world yeah which is like asking for the right like self-determination and control property and so on and like are also way more eloquent than we are yes he's like and we'll give you the welfare that we think you should do that yeah good okay yeah and I think the main reason I think it's plausible is that like we do observe this kind of thing with like non-human animals people are pretty happy to know pretty terrible to non-human animals but that's not the case works like for example I think that we should be concerned about the welfare of pigs and like make pigs life's good but I wouldn't then give pigs you know lots of GDP to like organize in a way that Higgs one that's for her but I suppose yeah the dis analogy there's that we think that we're more intelligent has Battaglia's and pigs whereas it's less clear that be true with AI but like in as much as I worry that I I wouldn't have good values it's it actually is quite analogous though yeah I mean I think your position is somewhat like the arguments are willing to make here somewhat unusual amongst Eamon's probably I think Melissa T ones do have more of a tight coupling between like a moral concern and like thinking the other thing deserves liberty just always determination and stuff like that right do you think they're bad arguments I suppose it yeah is it like it flows more naturally from a head mystic point of view than a preference utilitarian point of view that seems to be maybe where we're coming apart oh no I mean I also would be like yeah I care about the welfare of lots of agents who I believe like I believe it's like terrible a terrible bad thing though maybe not the worst thing ever if you like mistreating a bunch of AI systems yes I think they probably are like that some point to be moral patience yeah but like I would totally agree with you though like I could have that position and simultaneously believe that it was like either a terrible moral error to bring such beans into existence or a terrible moral error to give them great authority over what happens in the world yeah I think that's the likely place for us to end up in and I feel like the level of rear and carefulness and poet discussion is like now such that those kinds of things get pulled apart huh it probably mostly collapsed into a general like raw or poo or like I don't know that much about how we should any works but I'd be happy to take simple bets on this well there's some selfish reasons why people would not necessarily want to you know give lots a bunch of GDP you could imagine those groups that would say well we still want to own a eyes but we should treat them humanely I guess that doesn't sound too good it's not gonna play yeah also I mean there's just such a strong concentrated interest that is like so like most the cases where this goes badly are cases where there's like a large power imbalance but in the case we're talking about like the most effective lobbyists will be Isis yes yes like it's gonna be this very concentrated powerful interest which cares a lot about this issue has a plausible claim what looks really appealing like it seems kind of over determined basically yeah okay sounds super important this is mostly relevant again when people say things like no it's kind of crazy to imagine a I so I mean this is like the only resource we chose to and like I think that is the default I come yeah barring some sort of surprising developments and yeah like I've barely thought about this issue at all to be honest uh she perhaps is never say I need to need to think about it some more than I would maybe we could talk about it again I don't think it's that important an issue mostly I think but like tales of like how to make alignment work at such are more importantly I would just try and justify them by the additional argument that like to extend you care about what these air systems want you really would like to create a systems that are on the same page as humans like if you get to create a whole bunch of extra new agents it's really it could be great if you create a bunch of agents whose preferences are well aligned with the existing agents and it can be like you just create a ton of unnecessary like conflict and suffering if you create a new agents who want very different things okay so we're we're almost at a time but just a final few questions so you're not only working in this area but you're also a donor and you're trying to support projects that you think will contribute to to AI alignment but it's an area where there's a lot of people trying to do that there's perhaps more money than more money than the people who can usefully take it so it's a so I hear somewhat challenging to find really useful things to fund that aren't already getting funded how do you figure out what to fund and would you mind mentioning some of the things that you that you donate to now yeah so I think I would like to move towards the world or maybe it's like easier to work on kind of like anyone who is equipped to do reasonably alignment work is able to that with like a minimal hassle including like if they have differences of you with other organizations currently working in space or like if they're like not yet trained up and like wanna just take sometimes like think about the area and like see if it works out well I think they're definitely like they're definitely people who are doing work who are like interested in funding and I'd say like not doing crazy stuff and so one could just in order to inject more money like Diplo or in that like say look we like previously if we're not really restricted by funding then our bar ought not be like we're really convinced this thing is great our bar should just be like it looks like you're sort of a sensible person who like might eventually like figure out what's like you know it's like important part of personal growth maybe this project wound up being good for reasons we don't understand so one can certainly dip more in that direction I think that's not all used up yes the stuff I've funded in Asaf of last year has been like the biggest thing was funding us the next biggest was running like this sort of open call for individuals working on the lineman outside of any organization which has funded like three or four people I check us most recently like it's like a group working on like IRL under like weaker rationality assumptions in Europe and then also like support VMS wits and vladimir slept over running like an AI alignment prize which i'm funding and like a little bit involved in judging do you think of the donuts that are earning to give could find similarly promising projects if they if they looked around actively I think it's currently pretty hard in alignment I think there's like potentially room for like right so I think it's conceivable existing funders including me or like being too conservative in some respects and like you just say like look I really don't know if X is good but like there is a plausible story where this thing is good or like ensuring the like many people in the field had enough money that I could like every grant effectively like may admit even the conventionally I safe to crowd say I had no money they could read grant effectively and could do whatever they wanted yeah unless you're willing to give it a little bit crazy it's pretty hard I guess it also depends on what you're yeah I think depends on what your bar is I think like if AI is in fact like if we run short timelines then like the I interventions are still probably pretty good compared to other opportunities and there might be some qualitative sense of like this kind of feels like a longer shot or like a whack your thing go down 10 fund in most areas it's like I think a donor probably to be somewhat comfortable with that yeah there's also some claims like I haven't I think meri like is sort of can always use more money I think I think there are some other organizations that can also use more money and it's like not something they think about that much in general like giving is not something I've been thinking about that much because I think it's just like a lot it seems much better for me to personally be working on getting stuff done yeah that sounds right well this has been incredibly informative and you're so prolific that I've got got a whole lot more questions but we'll have to save them for another episode in future but I'll stick up links to some other things you've written that I think listeners who have stuck with the conversation this will be really interested to read and and yeah you do write up a lot of your ideas in in detail on your various blogs so listeners would like to like to learn more have definitely the opportunity to do so just one final question speaking of the blogs that you write about a week ago you you wrote about eight unusual science fiction plots that you wish someone would turn into a proper book or a movie and I guess they're they're very hard science fiction things that you think actually actually might happen and that we could learn from so what do you think's wrong with current sci-fi and what which was the which was your favorite of the of the ideas that you wrote up so I think a problem that I have and that maybe many similar people have is that it becomes difficult to enjoy science fiction as the world becomes less and less internally coherent and plausible like at the point when you're really trying to imagine what is this world like like what is this character actually thinking like what would their background be often if you try and do that almost all the time if you try and do that with existing science fiction if you like think too hard eventually the entire thing falls apart and it becomes like very difficult like you kind of have to like do a weird exercise in order to like not think too hard if you want to really sympathize with any of the characters or really even understand what's like think about what's going on in the plot I think it's extremely common it's very very rare to have any science fiction that doesn't have that problem I think it's like kind of a shame because it feels to me like the actual world like like the actual world we live in is super weird and like there's lots of super crazy stuff that I don't know if it will happen but certainly is like internally consistent that it could happen and I would really enjoy science fiction to like just flushed out like all the crazy that could happen I think it's a little bit more work and the basic problem is that like most readers just don't care at all or it's incredibly rare for people to care about to care much about the internal coherence of the world so people aren't willing to like spend extra time or like slightly compromised on like how convenient things are narratively I would guess the like the most amusing story from the ones I listed or the ones that would actually make the best fiction um would be like as I described one plot that was in the in Robbins like Age of M scenario which I think is you know if it doesn't fill in all the details is a pretty coherent scenario this is what I have like a bunch of simulated humans but mostly replaced normal humans in work core like alive during this like brief period of maybe a few calendar years as we transition from simulated human brains to like much more sophisticated AI and like in that worlds the experience of an am is like very very weird a number of ways one of which is like it's very easy to like you can put an M in a simulation of an arbitrary situation you can copy and you can reset them you can run an M like a thousand times through situation which I think like is a really interesting situation to end up in so I described like a plot that's sort of like a yeah I think like if you consider the genre of like con movies I like quite enjoy that genre I think I'd be like a really really interesting genre in this setting where like it's possible it's possible to like take a person to copy a person's brain to put them in simulations where like people actually have legitimate interests for wandering like not only what is this person we've been doing a simulation like what is this person who can do in a simulation when they're simulating someone else it's like incredibly complicated like the dynamic set situation and like also very conducive to yeah very conducive I think to amusing plots so I'd be pretty excited to read that fiction it'll probably be most amusing as a film I don't think it's ever gonna happen I think none of them will happen that's great that's it maybe after the singularity will be so rich we'll be able to to make make all kinds of science fiction that appeals just to a handful of people it'll be super awesome yeah once you it really powerfully yeah I can write for us we can each have a single AI just just producing films for one individual Oh thousands a day I it's just pretty seem like you're one it's gonna be super dream thanks so much for taking the time to come on the podcast poll and also just in general thanks so much for all the work that you're putting in to try to make the world a better place or at least the the future a better place thanks again for having me and thanks for running the podcast if you like this episode can I suggest going back and listening to our two previous episodes on AI technical research that's episode three with dr. dario a midi on open AI and how AI will change the world for good and ill and episode 23 how to actually become an AI alignment researcher according to yarn licker then you can go on and listen to our two episodes on AI policy and strategy that's number 31 professor Dafoe on defusing the political and economic risks posed by existing air capabilities in Episode one miles Brundage on the world's desperate need for AI strategists and policy experts and if those four episodes aren't enough for you there's episode 21 holden karnovski on x philanthropy transform the world an open falls plan to do the same which has a significant section on the open flame three projects plan to positively shape the development of transformative AI and again if you know someone who would be curious about these topics so already works adjacent to them please do let them know about the show that's how we find our most avid listeners and can most contribute to making the world a better place dirty thousand hours podcast is produced by Karen Harris thanks for joining talk to you in a week or two
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
27 Jul 2023
Superalignment with Jan Leike

This conversation examines technical alignment through Superalignment with Jan Leike, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -10 · avg -7 · 112 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs