Thoughts on rationalism and the rationalist community from a skeptical perspective. The author rejects rationality in the sense that he believes it isn't a logically coherent concept, that the larger rationalism community is insufficiently critical of it's beliefs and that ELIEZER YUDKOWSKY IS NOT THE TRUE CALIF.
In recent years a number of prominent individuals have raised concerns about our ability to control powerful AIs. The idea is that once we create truly human level generally intelligent software or AGI computers will undergo an intelligence explosion and will be able to escape any constraints we place on them. This concern has perhaps been most throughly developed by Eliezer Yudkowsky.
Unlike the AI in bad science fiction the concern isn’t that the AI will be evil or desire dominion the way humans are but simply that it will be too good at whatever task we set it to perform. For instance, suppose Waymo builds an AI to run its fleet of self-driving cars. The AI’s task is to converse with passengers/app users and route its vehicles appropriately. Unlike more limited self-driving car software this AI is programmed to learn the subtleties of human behavior so it can position a pool of cars in front of the stadium right before the game ends and helpfully show tourists the sites. On Yudkowsky’s vision the engineers achieve this by coding in a reward function that the software works to maximize (or equivalently a penalty function it works to minimize). For instance, in this case the AI might be punished based on negative reviews/frustrated customers, deaths/damage from accidents involving its vehicles, travel delays and customers who choose to use a competitor rather than Waymo. I’m already skeptical that (super) human AI would have anything identifiable as a global reward/utility function but on Yudkowsky’s picture AGI is something like a universal optimizer which is set loose to do its best to achieve rewards.
The concern is that the AI would eventually realize that it could minimize its punishment by arranging for everyone to die in a global pandemic since then there would be no bad reviews, lost customers or travel delays. Given the AI’s vast intelligence and massive data set it would then hack into microbiology labs and manipulate the workers there to create a civilization ending plague. Moreover, no matter what kind of firewalls or limitations we try and place on the AI as long as it can somehow interact with the external world it will find a way around these barriers. Since its devilishly difficult to specify any utility function without such undesirable solutions Yudkowsky concludes that AGI poses a serious threat to the human species.
Rewards And Reflection
The essential mechanism at play in all of Yudkowsky’s apocalyptic scenarios is that the AI examines its own reward function, realizes that some radically different strategy would offer even greater rewards and proceeds to surreptitiously work to realize this alternate strategy. Now its only natural that a sufficiently advanced AI would have some degree of reflective access to its own design and internal deliberation. After all it’s common for humans to reflect on our own goals and behaviors to help shape our future decisions, e.g., we might observe that if we continue to get bad grades we won’t get into the college we want and as a result decide that we need to stop playing World of Warcraft.
At first blush it might seem obvious that realizing its rewards are given by a certain function would induce an AI to maximize that function. One might even be tempted to claim this is somehow part of the definition of what it means for an agent to have a utility function but that’s trading off on an ambiguity between two notions of reward.
The sense of reward which gives rise to the worries about unintended satisfaction is that of positive reinforcement. It’s the digital equivalent of giving someone cocaine. Of course, if you administer cocaine to someone every time they write a blog post they will tend to write more blog posts. However, merely learning that cocaine causes a rewarding distribution of dopamine in the brain doesn’t cause people to go out and buy cocaine. Indeed, that knowledge could just as well have the exact opposite effect. Similarly, there is no reason to assume that merely because an AGI has a representation of their reward function they will try and reason out alternative ways to satisfy it. Indeed, indulging in anthropomorphizing for a moment, there is no reason to assume that an AGI will have any particular desire regarding rewards received by its future time states much adopt a particular discount rate.
Of course, in the long run, if a software program was rewarded for analyzing its own reward function and finding unusual ways to activate it then it could learn to do so just as people who are rewarded with pleasurable drug experiences can learn to look for ways to short-circuit their reward system. However, if that behavior is punished, e.g., humans intervene and punish the software when it starts recommending public transit, then the system will learn to avoid short-circuiting its reward pathways just like people can learn to avoid addictive drugs. This isn’t to say that there is no danger here, left alone an AGI, just like a teen with access to cocaine, could easily learn harmful reward seeking behavior. However, since the system doesn’t start in a state in which it applies its vast intelligence to figure out ways to hack its reward function the risk is far less severe.
Now, Yudkowsky might respond by saying he didn’t really mean the system’s reward function but its utility function. However, since we don’t tend to program machine learning algorithms by specifying the function they will ultimately maximize (or reflect on and try to maximize) its unclear why we need to explicitly specify a utility function that doesn’t lead to unintended consequences. After all, Yudkowsky is the one trying to argue that its likely that AGI will have these consequences so merely restating the problem in a space that has no intrinsic relationship to how one would expect AGI to be constructed doesn’t do anything to advance his argument. For instance, I could point out that phrased in terms of the locations of fundamental particles its really hard to specify a program that excludes apocalyptic arrangements of matter but that wouldn’t do anything to convince you that AIs risked causes such apocalypses since such specifications have nothing to do with how we expect an AI to be programed.
The Human Comparison
Ultimately, we have one example of a kind of general intelligence: the human brain. Thus, when evaluating claims about the dangers of AGI one of the first things we should do is see if the same story applies to our brain and if not if there is any special reason to expect our brains to be different.
Looking at the way humans behave its striking how poorly Yudkowsky’s stories describe our behavior even though evolution has shaped us in ways that make us far more dangerous than we should expect AGIs to be (we have self-preservation instincts, approximately coherent desires and beliefs, and are responsive to most aspects of the world rather than caring only about driving times or chess games). Time and time again we see that we follow heuristics and apply familiar mental strategies even when its clear that a different strategy would offer us greater activation of reward centers, greater reproductive opportunities or any other plausible thing we are trying to optimize.
The fact that we don’t consciously try and optimize our reproductive success and instead apply a forest of frameworks and heuristics that we follow even when they undermine our reproductive success strongly suggests that an AGI will most likely function in a similar heuristic layered fashion. In other words, we shouldn’t expect intelligence to come as a result of some pure mathematical optimization but more as a layered cake of heuristic processes. Thus, when an AI responsible for routing cars reflects on its performance it won’t see the pure mathematical question of how can I minimize such and such function any more than we see the pure mathematical question of how can I cause dopamine to be released in this part of my brain or how can I have more offspring. Rather, just as we break up the world into tasks like ‘make friends’ or ‘get respect from peers’ the AI will reflect on the world represented in terms of pieces like ‘route car from A to B’ or ‘minimize congestion in area D’ that bias it towards a certain kind of solution and away from plots like avoid congestion by creating a killer plague.
This isn’t to say there aren’t concerns. Indeed, as I’ve remarked elsewhere I’m much more concerned about schizophrenic AIs than I am about misaligned AI’s but that’s enough for this post.
So usually I find Scott Alexander’s posts pretty illuminating but, while his recent post on Conflict vs. Mistake Theories raises lots of interesting questions I think it fundamentally makes a mistake in trying to fit the type of extreme Marxist thinking he is describing into a framework of beliefs about the world and actions taken to advance those beliefs. While I think Scott appreciates this difficulty and attempts to wrestle with it, e.g., where he suggests the conflict theory take on is best exemplified by the “Baffler’s article saying that public choice theory is racist” ultimately his devotion to applying norms of charity to the other side leads him astray.
It’s not that there aren’t people like the conflict theorist Scott posits. I know there are a number of radical university professors who think to themselves, “Given the oppressive political structure and the power held by the elite the most effective way to bring about change isn’t to engage in rational argument but bring political or even physical force to bear.” However, for the most part the people Scott is trying to describe aren’t just like mistake theorists except they believe its intentional action by elites which makes the world bad rather than the difficulty of governing. No, such a theory would predict conflict theorists would retire back to their coffeehouses and perform cost-benefit calculations about the benefits of holding a particular protest or adopting a particular style of advocacy.
In other words conflict theorists aren’t mistake theorists who hide their true colors so as not to give the elites free ammunition but engage in the same kind of considerations as mistake theorists behind the scenes. No, fundamentally, most of the behavior Scott is seeking to describe is about emotional responses not a considered judgement that such emotional displays will best accomplish their ends.
A number of people have raised about intentionally trying to make contact with extraterrestrials. Most famously, Stephen Hawking famously warned that based on the history of first-contacts on Earth we should fear enslavement, exploitation or annihilation by more advanced aliens and the METI proposal to beam high powered signals into space has drawn controversy as well as criticism from David Brin for METI’s failure to engage in consultation with a broad range of experts. However, I’ve noticed a distinct lack of consideration of the potential benefits to alien life as a result of such contact.
For instance, while the proposal to send the google servers might limit our ability to trade in the future it also potentially provides the aliens with whatever benefits they might get from our scientific insights or our historical experiences. For instance, if we were to receive a detailed account of alien society’s struggle with climate change on their planet that second piece of data could be invaluable in choosing our own course not to mention the benefit scientific advancements could offer.
Indeed, if, as many people seem to think, there is some extinction level disaster waiting for civilizations once they reach, or slightly surpass, our current level of technology then such preemptive broadcasts might be the only serious hope of getting at least one sapient species through this Great Filter. While it might be pretty unlikely that our transmission would start the chain of records from doomed civilizations that will eventually push one species past the filter the returns to utility from such an outcome are so massive that such considerations might well outweigh any effect on humanity in the utility calculus.
Anyway, given the huge potential upside (even if unlikely) of an intervention which might improve life across the entire galaxy (even if at very low probability) I was wondering if anyone has done even back of the envelope calculations to estimate how funding projects trying to transmit useful data to extraterrestrials compares to the cost effectiveness of more earthly projects.
A Request For Clarification On What Predictive Processing Rules Out
So Scott Alexander has an interesting book review up about Surfing Uncertainty which I encourage everyone to read themselves. However, most of the post is really an exploration of the “predictive processing” model for brain function. I’ll leave a more in depth explanation of what this model is to Scott and just offer the following excerpt for those readers to lazy to click through.
Predictive processing begins by asking: how does this happen? By what process do our incomprehensible sense-data get turned into a meaningful picture of the world.
The key insight: the brain is a multi-layer prediction machine. All neural processing consists of two streams: a bottom-up stream of sense data, and a top-down stream of predictions. These streams interface at each level of processing, comparing themselves to each other and adjusting themselves as necessary.
As these two streams move through the brain side-by-side, they continually interface with each other. Each level receives the predictions from the level above it and the sense data from the level below it. Then each level uses Bayes’ Theorem to integrate these two sources of probabilistic evidence as best it can. This can end up a couple of different ways.
The upshot of these different ways is that when everything happens as predicted the higher levels remain unnotified of any change but that when there is a mismatch it draws attention from these higher layers. However, in some circumstances a strong prediction from a higher layer can cause lower layers to “rewrite the sense data to make it look as predicted.”
I admit that I’m intrigued by the idea of predictive processing, especially the suggestion that our muscle control is actually effectuated merely by `predicting’ our arm will be in a certain state and acting to minimize prediction error. However, my first reaction is to wonder how much content there is in this model.
Describing some kind of processing or control task in terms of predictions has a certain universality kind of feel to it. This is only a vague sense based on a book review but I worry that invoking the predictive processing model to describe how our brains work is much like invoking the lambda calculus model to describe how a particular computer functions. Namely, I worry that predictive processing is such a powerful model that virtually anything remotely plausible as a mechanism for processing sense data and effectuating control over our limbs could be fit into the model — meaning it offers no real insight.
I mean it was already apparent before this model came on to the scene that how we see even low level visual data is affected by high level classifications. The various figure-ground illusions make this point quite clearly. It was also already apparent that attention to one task (counting passes) could limit our ability to notice some other kind of oddity (a guy in a gorilla suit). However, its far from clear that the predictive processing model really adds anything to our understanding here.
Indeed, to even make sense of these examples we have to understand the relevant predictions to happen at a very abstract level that is highly context dependent so that by focusing on the number of basketball passes in a game it no longer counts as a sufficiently unpredicted event when a man in a gorilla suit walks past (or allows some other story about why paying one sort of attention suppresses this kind of notice). That’s fine but allowing this level of abstraction/freedom in describing the thing to be predicted makes me wonder what couldn’t be suitably described in terms of this model.
The attempt to describe our imagination, e.g., our ability to picture a generic police officer in our minds, as utilizing the mental machinery that would generate a sense-data stream as a prediction to match against reality raises more questions. Obviously, the notion of matching must be a very high level one quite removed from the actual pictorial representation if the mental image we conjure when we think of policemen is to be seen as matching the sense-data stream experienced when we encounter a policeman. Yet if the level at which we are evaluating a predictive match is so abstract why do we imagine a particular image when we think of a policeman and not merely whatever vague high level abstracta we will judge to match when we actually view a policeman. I’m sure there is a plausible theory to tell here about invoking the same lower level machinery we use to process sense-data when we imagine and leveraging that same feedback but, again, I’m left wondering what work predictive processing is really doing here.
More generally, I wonder to what extent all these predictions wouldn’t result from just assuming, as we know to be true, that the brain processes information in ‘layers’, there can be feedback between these layers and frequently the goal of our mental tasks is to predict events or control actions. Its not even obvious to me that the claimed predictions of the theory like the placebo effect couldn’t have equally well been spun the other way if the effect had been different, e.g., when your high level processes predict that you won’t feel pain it will be particularly salient when you nevertheless do feel pain so placebo pain meds should result in more people reporting pain.
But I haven’t read the book myself yet so maybe predictive processing has been suitably preciscified in the book so as to rule out many plausible ways the brain might have behaved and to clearly predict outcomes like the placebo effect. However, I wrote this post merely to raise the possibility that a paradigm like this can fail precisely because it is too good at describing phenomena. Hopefully, my worries are misplaced and someone can explain to me in the comments just what kind of plausible models of brain function this paradigm rules out.
I felt I would use this first post to explain this blog’s title. It is not, despite appearances to the contrary, meant to suggest any animosity toward the rationality community nor sympathy with the idea that when evaluating claims we should ever favor emotions and intuition over argumentation and evidence. Rather, it is intended as a critique of the ambiguous overuse of the term `rationality’ by the rationality community in general (and Yudkowsky specifically).
I want to suggest that there are two different concepts we use the word rationality to describe and that the rationality community overuses the term in a way that invites confusion. Both conceptions of rationality are judgements of epistemic virtue but the nature of that virtue differs.
Rationality As Ideal Evaluation Of Evidence
The first conception of rationality reflects the classic idea that rationality is a matter of a priori theoretical insight. This makes intuitive sense as rationality, in telling us how we should respond to evidence, shouldn’t depend on the particular way the evidence turns out. On this conception rationality constrains how one reaches judgements from arbitrary data and something is rational just if we expect it to maximize true beliefs in the face of a completely unknown/unspecified fact pattern1. In other words this is the kind of rationality you want if you are suddenly flung into another universe where the natural laws, number of dimensions or even the correspondence between mental and physical states might differ radically from our own.
On this conception having logically coherent beliefs and obeying the axioms of probability can be said to be rationally required (as doing so never forces you to belief less truths) but it’s hard to make a case for much else. Carnap (among others) suggested at one point that there might be something like a rationally (in this sense) preferable way of assigning priors but the long history of failed attempts and conceptual arguments suggests this isn’t possible.
Note that on this conception of rationality it is perfectly appropriate to criticize a belief forming method for how it might perform if faced with some other set of circumstances. For instance, we could appropriately criticize the rule: never believe in ghosts/psychics on the grounds that it would have lead us to the wrong conclusions in a world where these things were real.
Rationality As Heuristic
The second conception of rationality is simpler. Rationality is what will lead human beings like us to true beliefs in this world. Thus, this notion of rationality can take into account things that happen to be true. For instance, consider the rule that when asked a question on a math test (written by humans in the usual circumstances) that calls for a numerical answer you should judge that 0 is the most probable answer. This rule is almost certainly truth-conducive but only because it happens to be true that human psychology tends to favor asking questions whose answer is 0.
Now a heuristic like this might, at first, seem pretty distant from the kind of things we usually mean by rationality but think about some of the rules that are frequently said to be rationally required/favored. For instance, one should steelman2 your opponents arguments, try to consider the issue in the most dispassionate way you can manage and you should break up complex/important events.
For instance, suppose that humans were psychologically disposed to be overly deferential so it was far more common to underestimate the strength of your argument than it was to underestimate your opponent’s argument. In this case steelmanning would make us even more likely to reach the wrong conclusions not less. Similarly, our emotions could have reflected useful information available to our subconscious minds but not our concuss minds in such a way that they provided a good guide to truth. In such a world trying to reach probability judgements via dispassionate consideration wouldn’t be truth conducive.
Thus, on this conception of rationality whether or not a belief forming method is rational depends only on how well it does in the actual world.
The Problematic Ambiguity
Unfortunately, when people in the rationality community talk about rationality they tend to blur these two concepts together. That is they advocate belief forming mechanisms that could only be said to be rational in the heuristic sense but assume that they can determine matters of rationality purely by contemplation without empirical evidence.
For instance, consider these remarks by Yudkowsky or this lesswrong post. Whether or not they come out and assert it they convey the impression that there is some higher discipline or lifestyle of rationality which goes far beyond simply not engaging in logical contradiction or violating probability axioms. Yet they seem to assume that we can determine what is/isn’t rational by pure conceptual analysis rather than empirical validation.
This issue is even more clear when we criticize others for the ways they form beliefs. For instance, we are inclined to say that people who adopt the rule ‘believe what my community tells me is true’ or ‘believe god exists/doesn’t exist regardless of evidence’ are being irrational since such rules would yield incorrect results if they had been born in a community with crazy beliefs or in a universe with/without deities. Yet, as I observed above the very rules we take to be core rational virtues have the very same property.
The upshot of this isn’t that we should give up on finding good heuristics for truth. Not at all. Rather, I’m merely suggesting we take more care, especially in criticizing other people’s belief forming methods, to ensure we are applying coherent standards.
A Third Way
One might hope that there was yet another concept of rationality that someone split the difference of the two I provided here. A notion that allows us to take into account things like our psychological makeup or seemingly basic (if contingent) properties our universe has, e.g., we experience it as predictable rather than being an orderless succession of experiential states, but doesn’t let us build in facts like Yeti’s don’t exist into supposedly rational belief forming mechanisms. Frankly, I’m skeptical that any such coherent notion can be articulated but don’t currently have a compelling argument for that claim.
Finally, I’d like to end by pointing out there is another issue we should be aware of regarding the term rationality (though hardly unique to it). That is rationality is ultimately a property of belief forming rules while in the actual world what we get is instances of belief formation and some vague intentions about how we will form beliefs in the future. Thus there is the constant temptation to simply find some belief forming rule that qualifies as sufficiently rational and use it to justify this instance of our belief. However, it’s not generally valid to infer that you are forming beliefs appropriately just because each belief you form agrees with some sufficiently rational (in the heuristic sense) belief forming mechanism.
For instance, suppose there are a 100 different decent heuristics for forming a certain kind of belief. We know that each one is imperfect and gets different cases wrong but any attempt to come up with a better rule doesn’t yield anything humans (with our limited brains) can usefully apply. It is entirely plausible that almost any particular belief of this kind matches up with 1 of these 100 different heuristics thus allowing you to always cite a justification for your belief even though you underperform every single one of these heuristics.
I’m glossing over the question of whether there is a distinction between an arbitrary possible world and a `random’ possible world. For instance, suppose that some belief forming rule is true in all but finitely many possible worlds (out of some hugely uncountable set of possible worlds). That rule is not authorized in an arbitrary possible world (choose the counterexample world and it leads to falsehood) but intuitively it seems justified and any non-trivial probability measure (i.e. one that doesn’t concentrate on any finite set of worlds) on the space of possible worlds would assign probability 1 to the validity of the belief forming procedure. However, this won’t be an issue in this discussion. ↩
The opposite of strawmannirg. Rendering your opponents argument in the strongest fashion possible. ↩