In recent years a number of prominent individuals have raised concerns about our ability to control powerful AIs. The idea is that once we create truly human level generally intelligent software or AGI computers will undergo an intelligence explosion and will be able to escape any constraints we place on them. This concern has perhaps been most throughly developed by Eliezer Yudkowsky.
Unlike the AI in bad science fiction the concern isn’t that the AI will be evil or desire dominion the way humans are but simply that it will be too good at whatever task we set it to perform. For instance, suppose Waymo builds an AI to run its fleet of self-driving cars. The AI’s task is to converse with passengers/app users and route its vehicles appropriately. Unlike more limited self-driving car software this AI is programmed to learn the subtleties of human behavior so it can position a pool of cars in front of the stadium right before the game ends and helpfully show tourists the sites. On Yudkowsky’s vision the engineers achieve this by coding in a reward function that the software works to maximize (or equivalently a penalty function it works to minimize). For instance, in this case the AI might be punished based on negative reviews/frustrated customers, deaths/damage from accidents involving its vehicles, travel delays and customers who choose to use a competitor rather than Waymo. I’m already skeptical that (super) human AI would have anything identifiable as a global reward/utility function but on Yudkowsky’s picture AGI is something like a universal optimizer which is set loose to do its best to achieve rewards.
The concern is that the AI would eventually realize that it could minimize its punishment by arranging for everyone to die in a global pandemic since then there would be no bad reviews, lost customers or travel delays. Given the AI’s vast intelligence and massive data set it would then hack into microbiology labs and manipulate the workers there to create a civilization ending plague. Moreover, no matter what kind of firewalls or limitations we try and place on the AI as long as it can somehow interact with the external world it will find a way around these barriers. Since its devilishly difficult to specify any utility function without such undesirable solutions Yudkowsky concludes that AGI poses a serious threat to the human species.
Rewards And Reflection
The essential mechanism at play in all of Yudkowsky’s apocalyptic scenarios is that the AI examines its own reward function, realizes that some radically different strategy would offer even greater rewards and proceeds to surreptitiously work to realize this alternate strategy. Now its only natural that a sufficiently advanced AI would have some degree of reflective access to its own design and internal deliberation. After all it’s common for humans to reflect on our own goals and behaviors to help shape our future decisions, e.g., we might observe that if we continue to get bad grades we won’t get into the college we want and as a result decide that we need to stop playing World of Warcraft.
At first blush it might seem obvious that realizing its rewards are given by a certain function would induce an AI to maximize that function. One might even be tempted to claim this is somehow part of the definition of what it means for an agent to have a utility function but that’s trading off on an ambiguity between two notions of reward.
The sense of reward which gives rise to the worries about unintended satisfaction is that of positive reinforcement. It’s the digital equivalent of giving someone cocaine. Of course, if you administer cocaine to someone every time they write a blog post they will tend to write more blog posts. However, merely learning that cocaine causes a rewarding distribution of dopamine in the brain doesn’t cause people to go out and buy cocaine. Indeed, that knowledge could just as well have the exact opposite effect. Similarly, there is no reason to assume that merely because an AGI has a representation of their reward function they will try and reason out alternative ways to satisfy it. Indeed, indulging in anthropomorphizing for a moment, there is no reason to assume that an AGI will have any particular desire regarding rewards received by its future time states much adopt a particular discount rate.
Of course, in the long run, if a software program was rewarded for analyzing its own reward function and finding unusual ways to activate it then it could learn to do so just as people who are rewarded with pleasurable drug experiences can learn to look for ways to short-circuit their reward system. However, if that behavior is punished, e.g., humans intervene and punish the software when it starts recommending public transit, then the system will learn to avoid short-circuiting its reward pathways just like people can learn to avoid addictive drugs. This isn’t to say that there is no danger here, left alone an AGI, just like a teen with access to cocaine, could easily learn harmful reward seeking behavior. However, since the system doesn’t start in a state in which it applies its vast intelligence to figure out ways to hack its reward function the risk is far less severe.
Now, Yudkowsky might respond by saying he didn’t really mean the system’s reward function but its utility function. However, since we don’t tend to program machine learning algorithms by specifying the function they will ultimately maximize (or reflect on and try to maximize) its unclear why we need to explicitly specify a utility function that doesn’t lead to unintended consequences. After all, Yudkowsky is the one trying to argue that its likely that AGI will have these consequences so merely restating the problem in a space that has no intrinsic relationship to how one would expect AGI to be constructed doesn’t do anything to advance his argument. For instance, I could point out that phrased in terms of the locations of fundamental particles its really hard to specify a program that excludes apocalyptic arrangements of matter but that wouldn’t do anything to convince you that AIs risked causes such apocalypses since such specifications have nothing to do with how we expect an AI to be programed.
The Human Comparison
Ultimately, we have one example of a kind of general intelligence: the human brain. Thus, when evaluating claims about the dangers of AGI one of the first things we should do is see if the same story applies to our brain and if not if there is any special reason to expect our brains to be different.
Looking at the way humans behave its striking how poorly Yudkowsky’s stories describe our behavior even though evolution has shaped us in ways that make us far more dangerous than we should expect AGIs to be (we have self-preservation instincts, approximately coherent desires and beliefs, and are responsive to most aspects of the world rather than caring only about driving times or chess games). Time and time again we see that we follow heuristics and apply familiar mental strategies even when its clear that a different strategy would offer us greater activation of reward centers, greater reproductive opportunities or any other plausible thing we are trying to optimize.
The fact that we don’t consciously try and optimize our reproductive success and instead apply a forest of frameworks and heuristics that we follow even when they undermine our reproductive success strongly suggests that an AGI will most likely function in a similar heuristic layered fashion. In other words, we shouldn’t expect intelligence to come as a result of some pure mathematical optimization but more as a layered cake of heuristic processes. Thus, when an AI responsible for routing cars reflects on its performance it won’t see the pure mathematical question of how can I minimize such and such function any more than we see the pure mathematical question of how can I cause dopamine to be released in this part of my brain or how can I have more offspring. Rather, just as we break up the world into tasks like ‘make friends’ or ‘get respect from peers’ the AI will reflect on the world represented in terms of pieces like ‘route car from A to B’ or ‘minimize congestion in area D’ that bias it towards a certain kind of solution and away from plots like avoid congestion by creating a killer plague.
This isn’t to say there aren’t concerns. Indeed, as I’ve remarked elsewhere I’m much more concerned about schizophrenic AIs than I am about misaligned AI’s but that’s enough for this post.