Thoughts on rationalism and the rationalist community from a skeptical perspective. The author rejects rationality in the sense that he believes it isn't a logically coherent concept, that the larger rationalism community is insufficiently critical of it's beliefs and that ELIEZER YUDKOWSKY IS NOT THE TRUE CALIF.

Decision Theory Anti-realism

There Is No Fact Of The Matter About Correct Decision Theory

With the recent flurry of posts in the rationalist community about which decision theory ( e.g. CDT, EDT, UDT etc..) it’s time to revisit the theme of this blog: rejecting rationality realism. In this case that means pointing out that there isn’t actually a well-defined fact of the matter about which decision theory is better. Of course, nothing stops us from arguing with each other about the best decision theory but those disagreements are more like debates about what’s the best programming language than disagreements about the chemical structure of Benzene.

Any attempt to compare decision theories must first address the question: what does it mean for one decision theory to be better than another? Unlike many pseudo-problems1 there is a seemingly meaningful answer to this question: one decision theory is better than another to the extent that the choices it recommends lead to better outcomes for the agent. Other than some ambiguity about which theory is better if neither dominates the other it seems like this gives a straightforward criteria for superiority: we just look at actual outcomes and see which decision theory offers the best results for an agent. However, this only appears to give a well-defined criteria because in every day life the subtle differences between the various ways to understand a choice and how to conceptualize making a choice don’t matter.

In particular, the kind of scenarios which distinguish between the various decision theories yield different answers depending on whether you want to know who you should be (i.e. total source code) to do best, how you should program an agent if you want them to do best, which decision rule should you adopt for you to do best, and what choice gives you the best outcome. Furthermore, these scenarios call into question how the supposed ‘choices’ made by the decision theory relate to our intuitive notion in a way that makes them relevant to some notion of good decision making or if they are simply demanding the laws of physics/logic give way to offer them a better outcome in a way that has nothing to do with actual decisions.

Intuitions and Motivation

I’m sure some readers are shaking their heads at this point and saying something like

I don’t need to worry about technical issues about how to understand a choice. I can easily walk through Newcomb style problems and the rules straightforwardly tell me who gets what which is enough to satisfy my intuitive notion that theory X is better. Demanding one specify all these details is nitpicking.

To convince you that’s not enough let me provide an extreme dramatization of how purported payouts can be misleading and the question turns on a precise specification of the question. Consider the following Newtonian, rather than Newcombian, problem. You fall off the top of the empire State building what you do as you fall past the fifth floor? What would one say about the virtues of Floating Decision Theory which tells us that in such a situation we should make the choice to float gently to the ground. Now obviously, one would prefer to float rather than fly but posing the problem as a decision between these two choices doesn’t render it a real choice. Obviously, there is something dubious about evaluating your decision theory based on it’s performance on the float/fall question. At least on one conception a decision theory is no worse for failing to indicate the agent do something impossible for them so we can’t merely blindly assume that anytime we are handed a set of ‘choices’ and told what their payoffs are we can simply take those at face value.

Yet, this is precisely the situation we encounter in the original Newcomb problem as the very assumption of predictability which allows the demon2 to favor the 1 boxers ensures the physical impossibility of choosing any number of boxes other than what you did choose. Of course, the same is (up to quantum mechanical randomness) true of any actual `choice’ by a real person but under certain circumstances we find it useful to idealize it as free choice. What’s different about the Newcomb problem is that, understood naively, it simultaneously asks us to idealize selecting 1 or 2 boxes as a free choice while assuming it isn’t actually. Thus, it’s reasonable to worry that our intuitions about choices can’t just be applied uncritically in Newcomb type problems and now I’ll hope to motivate the concern that there might be multiple ways to understand the question being asked.

Let’s now modify this situation, by imagining that we actually live in the Marvel Universe so there are a number of people (floaters) who respond to large falls by, moments before impact, suddenly decelerating and floating gently to the ground. Now suppose we pose the question of whether, as you fall past the 5th floor, you should choose to have been born a floater or not. Obviously, this question suffers from the same infirmities as the above example in that intuitively there is no ‘choice’ involved in being a floater or not but being a floater. However, we can mask this flaw by instead of phrasing the choice as between being a floater and not instead phrasing it as being between yelling, “Holy shit I’m a floater” and concentrating totally on desperately trying to orient yourself so your feet strike first. Now presuming there is a strong (even exceptionless) psychological regularity that only floaters take the first option it follows that EDT recommends making such a yell while CDT doesn’t.

However, taking a look at the situation it seems clear that the two theories are in some sense answering different questions. If I wanted to know whether or not it is preferable to be the kind of person who yells “Holy shit I’m a floater” then I should consult EDT for an answer. Instead, if I’m interested in what I should do in that situation that doesn’t seem particularly relevant. I believe this should move us to consider the possibility we haven’t asked a clear question when we ask what the right decision theory is and in the next section I will consider a variety of ways the problem we’re trying to solve can be precisified and not they give rise to different decision theories.

Possible Precisifications

Ultimately, there is something a bit weird about asking what decision a real physical agent should take in a given situation. After all, the agent will act just as it’s software dictates and/or the laws of physics require. Thus, as Yudkowsky recognizes, any comparison of decision theories is asking some kind of counterfactual. However, which counterfactual we ask makes a huge difference in what decision theory is preferable. For instance, all of the following are potential ways to precifisify the question of what it means for it to be better for XDT to be a better deciscion theory than YDT.

  1. If there was a miracle that overrode the agent’s programming/physical laws at the moment of a choice then doing so in the manner prescribed by XDT yields better outcomes than doing so in a manner prescribed by YDT.
  2. In fact those actual agents who more often choose the outcome favored by XDT do better than those who choose the outcome favored by YDT.
  3. Those actual agents which adopt/apply XDT do better than those who adopt/apply YDT.
  4. Suppose there is a miracle that overrode physical laws at the moment the agent’s programming/internal makeup is specified then if the miracle results in outcomes more consistent with XDT than YDT the agent does better.
  5. As above except with applying XDT/YDT instead of just favoring outcomes which tend to agree with it.
  6. Moving one level up we could ask about which performs better, agents whose programming inclines them to adopt XDT or YDT when considered.
  7. Finally, if what we are interested in is actually coding agents, i.e., writing AI software, we might ask whether programmers who code their agents to reason in a manner that prefers choice A produce agents that do better than programmers who code agents to reason in a manner that prefers choice B.
  8. Pushing that one level up we could ask about whether programmers who are inclined to adopt/apply XDT/YDT as true produce agents which do better.

One could continue and list far more possibilities but these six are enough to illustrate the point.

For instance, note that if we are asking question 1 CDT outperforms EDT. For the purposes of question 1 the right answer to the Newcomb problem is to be a 2 boxer. After all, if we idealize the choice as a miracle that allows deviation from physical law then the demon’s prediction of whether we would be a two-boxer or one-boxer no longer must be accurate so two-boxes always outperforms one boxing. It doesn’t matter that your software says you will choose only one box if we are asking about outcomes where a miracle occurs and overrides that software.

On the other hand it’s clearly true that EDT does better than CDT with respect to question 2. That’s essentially the definition of EDT.

To distinguish the remaining options we need to consider a range of different scenarios such as demons who punish agents who actually apply/adopt XDT/YDT in reaching their conclusions. Or consider Newcombian demons who punish agents who adopt (or whose programmers adopted one of XDT/YDT).

Ultimately, which criteria we should use to compare decision theories depends on what we want to achieve. Different idealizations/criteria will be appropriate depending on whether we are asking which rule we ourselves should adopt, how we should program agents to act, how we should program agents who program agents etc.. etc… Moreover, I’d suggest that once we’ve fully preciscified the kind of question we want to ask the whole debate about which decision theory is best becomes irrelevant. Given a fully specified question we can just sit down and compute (or do empirical analysis) and when we can’t it indicates that we’ve failed to fully specify what we are asking.

The Use of Decision Theory By Agents

As a postscript I’d note that it’s also misguided to assume that the right way to program some kind of AI agent is to have that agent adopt some kind of decision theory like framework. Many discussions of decision theories seem to presume this by phrasing questions in terms of what decision theory should an AI apply/adopt. However, there is no reason to suppose that the way to produce the behavior favored by XDT is for the agent to actually believe/apply XDT. For instance, if a demon punishes agents who have adopted XDT then the outcomes XDT prefers might be best achieved by agents which explicitly eschew XDT. More pragmatically, it’s not at all clear that the most effective way for agents to reach XDT compatible outcomes is to perform the considerations demanded by XDT. That’s a good way to implement some algorithms but not all.

The reason that decision theory is useful in normal situations (i.e. lacking Omega/Newcombian demons) is that it’s a decent heuristic to assume that the way we internally consider outcomes/make choices doesn’t affect the payout we receive. Under this assumption pretty much all ways of preciscifying the question give the same answer and it offers some good advice for programming agents. However, the usefulness of the framework once we abandon this isn’t clear and can’t simply be assumed.
Thus, not only would I argue that the debate over which decision theory is best is misguided, but that we need to be more careful about the assumptions we make about applicability as well.

Thus, not only would I argue that the debate over which decision theory is best is misguided, but that we need to be more careful about the assumptions we make about applicability as well.


  1. For instance, any attempt to answer what makes one programming language better than another reveals substantial disagreement about which tradeoffs are desirable and no agreed upon framework for resolving them. Indeed, we in some sense all recognize that which programming language tradeoffs are desirable is context dependent. 
  2. Or in Yudkowsky’s formulation, Omega. 

Artificial Intelligence And The Structure Of Thought

Why Your Self-Driving Car Won't Cause Armageddon

In recent years a number of prominent individuals have raised concerns about our ability to control powerful AIs. The idea is that once we create truly human level generally intelligent software or AGI computers will undergo an intelligence explosion and will be able to escape any constraints we place on them. This concern has perhaps been most throughly developed by Eliezer Yudkowsky.

Unlike the AI in bad science fiction the concern isn’t that the AI will be evil or desire dominion the way humans are but simply that it will be too good at whatever task we set it to perform. For instance, suppose Waymo builds an AI to run its fleet of self-driving cars. The AI’s task is to converse with passengers/app users and route its vehicles appropriately. Unlike more limited self-driving car software this AI is programmed to learn the subtleties of human behavior so it can position a pool of cars in front of the stadium right before the game ends and helpfully show tourists the sites. On Yudkowsky’s vision the engineers achieve this by coding in a reward function that the software works to maximize (or equivalently a penalty function it works to minimize). For instance, in this case the AI might be punished based on negative reviews/frustrated customers, deaths/damage from accidents involving its vehicles, travel delays and customers who choose to use a competitor rather than Waymo. I’m already skeptical that (super) human AI would have anything identifiable as a global reward/utility function but on Yudkowsky’s picture AGI is something like a universal optimizer which is set loose to do its best to achieve rewards.

The concern is that the AI would eventually realize that it could minimize its punishment by arranging for everyone to die in a global pandemic since then there would be no bad reviews, lost customers or travel delays. Given the AI’s vast intelligence and massive data set it would then hack into microbiology labs and manipulate the workers there to create a civilization ending plague. Moreover, no matter what kind of firewalls or limitations we try and place on the AI as long as it can somehow interact with the external world it will find a way around these barriers. Since its devilishly difficult to specify any utility function without such undesirable solutions Yudkowsky concludes that AGI poses a serious threat to the human species.

Rewards And Reflection

The essential mechanism at play in all of Yudkowsky’s apocalyptic scenarios is that the AI examines its own reward function, realizes that some radically different strategy would offer even greater rewards and proceeds to surreptitiously work to realize this alternate strategy. Now its only natural that a sufficiently advanced AI would have some degree of reflective access to its own design and internal deliberation. After all it’s common for humans to reflect on our own goals and behaviors to help shape our future decisions, e.g., we might observe that if we continue to get bad grades we won’t get into the college we want and as a result decide that we need to stop playing World of Warcraft.

At first blush it might seem obvious that realizing its rewards are given by a certain function would induce an AI to maximize that function. One might even be tempted to claim this is somehow part of the definition of what it means for an agent to have a utility function but that’s trading off on an ambiguity between two notions of reward.

The sense of reward which gives rise to the worries about unintended satisfaction is that of positive reinforcement. It’s the digital equivalent of giving someone cocaine. Of course, if you administer cocaine to someone every time they write a blog post they will tend to write more blog posts. However, merely learning that cocaine causes a rewarding distribution of dopamine in the brain doesn’t cause people to go out and buy cocaine. Indeed, that knowledge could just as well have the exact opposite effect. Similarly, there is no reason to assume that merely because an AGI has a representation of their reward function they will try and reason out alternative ways to satisfy it. Indeed, indulging in anthropomorphizing for a moment, there is no reason to assume that an AGI will have any particular desire regarding rewards received by its future time states much adopt a particular discount rate.

Of course, in the long run, if a software program was rewarded for analyzing its own reward function and finding unusual ways to activate it then it could learn to do so just as people who are rewarded with pleasurable drug experiences can learn to look for ways to short-circuit their reward system. However, if that behavior is punished, e.g., humans intervene and punish the software when it starts recommending public transit, then the system will learn to avoid short-circuiting its reward pathways just like people can learn to avoid addictive drugs. This isn’t to say that there is no danger here, left alone an AGI, just like a teen with access to cocaine, could easily learn harmful reward seeking behavior. However, since the system doesn’t start in a state in which it applies its vast intelligence to figure out ways to hack its reward function the risk is far less severe.

Now, Yudkowsky might respond by saying he didn’t really mean the system’s reward function but its utility function. However, since we don’t tend to program machine learning algorithms by specifying the function they will ultimately maximize (or reflect on and try to maximize) its unclear why we need to explicitly specify a utility function that doesn’t lead to unintended consequences. After all, Yudkowsky is the one trying to argue that its likely that AGI will have these consequences so merely restating the problem in a space that has no intrinsic relationship to how one would expect AGI to be constructed doesn’t do anything to advance his argument. For instance, I could point out that phrased in terms of the locations of fundamental particles its really hard to specify a program that excludes apocalyptic arrangements of matter but that wouldn’t do anything to convince you that AIs risked causes such apocalypses since such specifications have nothing to do with how we expect an AI to be programed.

The Human Comparison

Ultimately, we have one example of a kind of general intelligence: the human brain. Thus, when evaluating claims about the dangers of AGI one of the first things we should do is see if the same story applies to our brain and if not if there is any special reason to expect our brains to be different.

Looking at the way humans behave its striking how poorly Yudkowsky’s stories describe our behavior even though evolution has shaped us in ways that make us far more dangerous than we should expect AGIs to be (we have self-preservation instincts, approximately coherent desires and beliefs, and are responsive to most aspects of the world rather than caring only about driving times or chess games). Time and time again we see that we follow heuristics and apply familiar mental strategies even when its clear that a different strategy would offer us greater activation of reward centers, greater reproductive opportunities or any other plausible thing we are trying to optimize.

The fact that we don’t consciously try and optimize our reproductive success and instead apply a forest of frameworks and heuristics that we follow even when they undermine our reproductive success strongly suggests that an AGI will most likely function in a similar heuristic layered fashion. In other words, we shouldn’t expect intelligence to come as a result of some pure mathematical optimization but more as a layered cake of heuristic processes. Thus, when an AI responsible for routing cars reflects on its performance it won’t see the pure mathematical question of how can I minimize such and such function any more than we see the pure mathematical question of how can I cause dopamine to be released in this part of my brain or how can I have more offspring. Rather, just as we break up the world into tasks like ‘make friends’ or ‘get respect from peers’ the AI will reflect on the world represented in terms of pieces like ‘route car from A to B’ or ‘minimize congestion in area D’ that bias it towards a certain kind of solution and away from plots like avoid congestion by creating a killer plague.

This isn’t to say there aren’t concerns. Indeed, as I’ve remarked elsewhere I’m much more concerned about schizophrenic AIs than I am about misaligned AI’s but that’s enough for this post.

AI Bias and Subtle Discrimination

Don't Incentivize Discrimination To Feel Better

This is an important point not just about AI software but discussions about race and gender more generally. Accurately reporting (or predicting) facts that, all too often, are the unfortunate result of a long history of oppression or simple random variation isn’t bias.

Personally, I feel that the social norm which regards accurate observation of facts such as (as mentioned in the article) racial differences in loan repayment rate conditional on wealth to be a reflection of bias is just a way of pretending society’s social warts don’t exist. Only by accurately reporting such effects can we hope to identify and rectify the causes, e.g., perhaps differences in treatment make employment less stable for certain racial groups or whether or not the bank officer looks like you affects likelihood of repayment. Our unwillingness to confront these issues places our personal interest in avoiding the risk of seeming racist/sexist over the social good of working out and addressing the causes of these differences.

Ultimately, the society I want isn’t the wink and a nod cultural in which people all mouth platitudes but we implicitly reward people for denying underrepresented groups loans or spots in colleges or whatever. I think we end up with a better society (not the best, see below) when the bank’s loan evaluation software spits out a number which bakes in all available correlations (even the racial ones) and rewards the loan officer for making good judgements of character independent of race rather than the system where the software can’t consider that factor and we reward the loan officers who evaluate the character of applications of color more negatively to compensate or the bank executives who choose not to place branches in communities of color and so on. Not only does this encourage a kind of wink and nod racism but when banks optimize profits via subtle discrimination rather than explicit consideration of the numbers one ends up creating a far higher barrier to minorities getting loans than a slight tick up in predicted default rate. If we don’t want to use features like the applicant race in decisions like loan offers, college acceptance etc.. we need to affirmatively acknowledge these correlations exist and ensure we don’t implement incentives to be subtly racist, e.g., evaluate loan officer’s performance relative to the (all factors included) default rate so we don’t implicitly reward loan officers and bank managers with biases against people of color (which itself imposes a barrier to minority loan officers).

In short, don’t let the shareholders and executives get away with passing the moral buck by saying ‘Ohh no, we don’t want to consider factors like race when offering loans’ but then turning around and using total profits as the incentive to ensure their employees do the discrimination for them. It may feel uncomfortable openly acknowledging such correlates but not only is it necessary to trace out the social causes of these ills but the other option is continued incentives for covert racism especially the use of subtle social cues of being the ‘right sort’ to identify likely success and that is what perpetuates the cycle.

 

A.I. ‘Bias’ Doesn’t Mean What Journalists Say it Means

In Florida, a criminal sentencing algorithm called COMPAS looks at many pieces of data about a criminal and computes the probability that they will commit new crimes. Judges use these risk scores in criminal sentencing and parole hearings to determine whether the offender should be kept in jail or released.