AI beyond alignment

Even if we solve intent alignment and build AI systems that are trying to do what their deployers want them to do, plenty of issues remain to be addressed if we are to successfully navigate the transition to a world with advanced AI, as an increasing number of people are pointing out.

Part of the problem is that AI alignment shouldn’t be conflated with AI moral achievement,” as Matthew Barnett explains:

if we succeed at figuring out how to make AIs pursue our intended goals, these AIs will likely be used to maximize the economic consumption of existing humans at the time of alignment. And most economic consumption is aimed at satisfying selfish desires, rather than what we’d normally consider our altruistic moral ideals.

Solving AI alignment does not once and for all solve the problem of how multiple autonomous individuals with partially conflicting interests should cooperate and coexist—a problem that will likely always be with us. But the transition to a world with advanced AI does represent a crucial time where many fundamental parameters of this implicit agreement may need to be renegotiated. And doing so requires doing a wide range work, going far beyond issues of technical alignment.

Nick Bostrom discusses some of the relevant issues in his two papers on digital minds, and I imagine his upcoming book will focus on how to live in a post-AI world.

Holden Karnofsky explicitly highlights that transformative AI issues are not just misalignment, listing the following further problems:

  • Power imbalances. As AI speeds up science and technology, it could cause some country/countries/coalitions to become enormously powerful - so it matters a lot which one(s) lead the way on transformative AI. (I fear that this concern is generally overrated compared to misaligned AI, but it is still very important.) There could also be dangers in overly widespread (as opposed to concentrated) AI deployment.
  • Early applications of AI. It might be that what early AIs are used for durably affects how things go in the long run - for example, whether early AI systems are used for education and truth-seeking, rather than manipulative persuasion and/or entrenching what we already believe. We might be able to affect which uses are predominant early on.
  • New life forms. Advanced AI could lead to new forms of intelligent life, such as AI systems themselves and/or digital people. Many of the frameworks we’re used to, for ethics and the law, could end up needing quite a bit of rethinking for new kinds of entities (for example, should we allow people to make as many copies as they want of entities that will predictably vote in certain ways?) Early decisions about these kinds of questions could have long-lasting effects.
  • Persistent policies and norms. Perhaps we ought to be identifying particularly important policies, norms, etc. that seem likely to be durable even through rapid technological advancement, and try to improve these as much as possible before transformative AI is developed. (These could include things like a better social safety net suited to high, sustained unemployment rates; better regulations aimed at avoiding bias; etc.)
  • Speed of development. Maybe human society just isn’t likely to adapt well to rapid, radical advances in science and technology, and finding a way to limit the pace of advances would be good.

Of course, AI governance is already a thriving research field, and many of the issues are within its scope. (Have some examples here)Yet many of the issues are more fundamental, and may require us to reconceive fundamental notions of political philosophy.

GPIs new research agenda on on risks and opportunities from artificial intelligence covers some relevant topics, e.g. political philosophy:

Some of the risks posed by AI are political in nature, including the risks posed by AI-enabled dictatorships. Other risks will inevitably involve a political dimension, for example with regulation and international agreements playing an important role in enabling or mitigating risks. For this reason, it’s likely that political philosophy will be able to provide insight. Questions we’re interested in include: Should AI development be left in the hands of private companies? How if at all should our political and economic institutions change if we one day share the world with digital moral patients or agents? Will AI exacerbate and entrench inequalities of wealth and power? Will AI cause mass unemployment? Will AI increase the risk of war between great powers? In each of these cases, how severe is the threat, what can be done to mitigate it, and what are the relevant trade-offs?


GPI is interested in work that clarifies the nature of lock-in and the relationship between lock-in and the achievement of a desirable future. We’re also interested in work that explores whether AI is likely to bring about various types of lock-in (Karnofsky, 2021; Finnveden et al., 2022). One important-seeming type is value lock-in (MacAskill, 2022, Chapter 4): the values instantiated by advanced AI could persist for a very long time. That suggests that it is especially important to get these values right. Unfortunately, there are also many ways in which we might get these values wrong. We might endow powerful AI with the wrong theory of normative ethics, or the wrong theory of welfare, or the wrong axiology, or the wrong population ethics, or the wrong decision theory, or the wrong theory of infinite ethics. Each of these mistakes could make the future significantly worse than it otherwise would be. With what values - if any - should we endow AI?

Navigating rapid change:

As noted above, AI might lead to rapid societal and technological change. What can we do ahead of time to mitigate the risks and realise the opportunities? One idea is ensuring that powerful actors agree ahead of time to coordinate in certain ways. For example, actors might agree to share the benefits of AI and to refrain from taking actions that might be irreversible, like settling space and developing dangerous technologies. What sort of agreements would be best? Could humanity bring about and enforce agreements of this kind?

Digital minds:

Various issues at the intersection of value theory and the philosophy of mind might be relevant to determining whether AI counts as a moral patient and how we ought to treat AI systems if so. This might include work exploring the nature of consciousness and sentience, work exploring which mental properties are relevant to moral status, and work exploring the nature of wellbeing.

Digital minds might raise unique challenges for political philosophy. For example, digital minds might be able to duplicate themselves with relative ease, which might raise challenges for integrating them into democratic systems. How if at all should our political systems change in that case?

Many of Samuel Hammonds recent blog posts are also in this vein, and Lukas Finnveden has a recent series of posts on non-alignment project ideas for making transformative AI go well, covering governance during explosive technological growth, epistemics, sentience and rights of digital minds, backup plans for misaligned AI, and cooperative AI.

This is the general sort of area I hope to be working in. Now I just need to figure out some specific projects do dig into.

February 5, 2024