Paul the Octopus, relevance and the joy of superstition

So, as you all know, Spain beat the Netherlands and won the World Football Cup in Johannesburg on July 11, 2010. As most of you may also know, this victory was predicted by a German octopus named Paul. Paul was presented before the match with two transparent boxes each baited with mussel flesh and decorated one with the Spanish flag, the other with the Dutch flag, and, yes, Paul the octopus correctly chose the Spanish flag box. One chance out of two, you might sneer, but Paul had correctly predicted, by the same method, the results of the seven matches in which the German team played. The probability of achieving by chance such a perfect series of prediction is 1/256 or 0.003906. More impressive, no? Paul the Octopus is now a TV news star: he has today more than 200,000 Google entries and more than 170,000 Facebook friends; he has received both death threats and commercial offers, and so on. On July 12, Paul’s owners presented him with a replica World Cup trophy and announced that “he won’t give any more oracle predictions – either in football, or in politics, lifestyle or economy.”

Should you be impressed?

By Paul’s performance, not really. To begin with, as argued here, Paul may have been biased to prefer some flag patterns better exemplified by the German flag and the Spanish flag than by, say, the Unions Jack or the Dutch flag. Much more importantly, many other animals around the world, for instance rabbits, hippopotamuses, and parakeets also made predictions regarding the World Cup matches. Those that failed the first time (it must have been about half of them) disappointed their masters and did not make the news, nor did those who failed on the second or the third trial. By then, about one eight of the animal oracles were still in business and might each have gained some local fame. The fallacy involved (known as the ‘prosecutor’s fallacy’*) is now easy to discern: if Paul had been selected at random among hundreds of candidates to be the unique World Cup animal oracle, his perfect performance would be remarkable. What happened however is that he had been selected as one the Cup’s informal oracles because his performance had been perfect for five or six rounds. The chance that it would remain perfect for another three or two rounds was by then a mere 1/8 or 1/4. When this indeed occurred, he stood alone in what looked like preternatural oracular talent.

There are good reasons however, from a cognition-and-culture point of view, to be impressed with the interest generated by Paul’s ‘predictions’. The fallacy involved may arise in ordinary cognition and even more in cultural transmission. We are plausibly geared to pay attention to events that present a pattern that goes against our expectations. Eight flips of a coin resulting in eight tails grab our attention more than say a tail-tail-head-tail-head-head-head-tail series. Even though both series are equally improbable, only the first one has a noticeable pattern that causes us to pay attention to its improbability. We are geared to assume that regular patterns have a relevant explanation (a relevant explanation is one that explain a lot with relatively few assumptions, a good effort-effect ratio). The less we are able to explain a regular pattern, the more we are likely, ceteris paribus, to see its occurrence as remarkably improbable and its explanation as relevant. (Incidentally, this is far from being an absurd assumption.) Strikingly improbable regular patterns also provide us with a relevant topic of conversation. Hence, through communication, we are likely to pay much more attention to such patterns than we would if we were solitary observers of the world’s regularities and to overestimate their frequency. Hence the prosecutor’s fallacy is reinforced through communication (or, arguably, originates in communication). As a result, a variety of ‘superstitions’ seem to benefit from rich, culturally transmitted evidence.

Yes, but in the case of Paul the octopus, journalists who talked about his predictions did point out that they resulted from chance, and few of the people who spread the information – as I am doing right now – mistook it for a case of true predictions. So the idea of fallacy-based, apparently relevant, false beliefs falls short of really providing us with a good explanation of Paul’s fame. Still, the proper explanation, I would suggest, is not very far, and it is quite interesting in its own right. The mental mechanisms that cause us to pay more attention to information with greater expected relevance and to communicate it more readily are, Deirdre Wilson and I have argued (see here and here) not really representing the relevance of information and even less calculating it: they are, rather, sensitive to rough features that correlate well enough with relevance to make it adaptive to allocate mental resources, e.g. attention, on the basis of these features. So, information that looks relevant by these rough criteria is enough to trigger the processes and the mental elation that goes with expectations of relevance. It falls, if I may use the jargon, in the ‘actual domain’ of relevance detectors, whether or not it falls in their ‘proper domain’, just as pornography may cause arousal even though it falls outside of the proper domain of potential sexual partner detection.

We may get pleasure from having our expectations of relevance aroused. We often indulge in this pleasure for its own sake rather than for the cognitive benefits that only truly relevant information may bring. This, I would argue, is why, for instance, we read light fiction. This is why I could not resist the temptation of writing a post about Paul the octopus even before feeling confident that I had anything of relevance to say about it.

* The “prosecutor’s fallacy” typically results from multiple testings: if a male murderer is known to have had blue eyes, read hair, plucked eyebrows and a limp, the larger the population that you test for this combination of features, the higher the probability that you will find an individual matching it, but the lower the probability that he will be the murderer you are looking for.


  • comment-avatar
    Pascal Boyer 15 July 2010 (21:33)

    For excellent dramatic use of the “prosecutor’s fallacy”, watch episode 2, season 3 of “Alfred Hitcock presents”, titled “Mail Order Prophet” (1957). In that short film, an accountant in a big corporation receives over several weeks a series of letters that include successful predictions about the outcomes of boxing and football matches. He is naturally led to trust the sender’s prescience, and eventually agrees to ‘borrow’ company money to help the mysterious prophet bet on yet another game. Naturally, the hero cannot suspect that he was only one of the 64, then 32, then 16, then 8, and so on, recipients of these letters. Although I spoiled everything by revealing the trick, it is still great TV. More detail here: Relevance is certainly part of why we would be fascinated by such predictions – and in this case part of the relevance trigger may lie in our statistical intuitions, which tell us that no-one can make more than a (very small) number of successful predictions in a row, so there [i]must[/i] be something else at work. Even though the explanation is pretty simple, it is not really intuitive. In communication situation, we do not usually consider ourselves as balls in a probability urn. So, an “opaque” and silly but rather natural thought conflicts with a clear and unnatural one. Incidentally, people all over the world are particularly impressed by [i]repeated[/i] misfortune. Losing one cow to disease may be tolerated, but losing it [i] and [/i] having a bad crop [i]and [/i] an accident does look suspicious. It is such situations that lead people to consider than someone is after them. Is the relevance of witchcraft notions based on this same interplay of probability intuitions and the un-naturalness of alternative explanations? When I was a child, I used to believe that only bad things would ever happen to me. People pointed out to me that that was an irrational belief because bad and good things happen randomly. I tried to tell them that, precisely for that reason, there had to be one person for whom the random selection of good and bad stuff comes out bad on the first round and the second round and the third round and so on… I am glad now to understand that I was being rational (though perhaps depressive).

  • comment-avatar
    Jacob Lee 16 July 2010 (06:23)

    Dan Sperber wrote: [quote]We are plausibly geared to pay attention to events that present a pattern that goes against our expectations. Eight flips of a coin resulting in eight tails grab our attention more than say a tail-tail-head-tail-head-head-head-tail series. Even though both series are equally improbable, only the first one has a noticeable pattern that causes us to pay attention to its improbability. We are geared to assume that regular patterns have a relevant explanation (a relevant explanation is one that explain a lot with relatively few assumptions, a good effort-effect ratio).[/quote] I think that this is a very interesting line to pursue. I believe that it can be usefully framed in the language of information theory, however (something that were I given the chance, I would like to pursue!). In conventional probability theory, eight flips of a fair coin is equally likely to result in any sequence of heads and tails. But, suppose that we did not know that it was a truly uniform random process. Instead, let us suppose that the sequence of heads and tails is the result of a deterministic and computable process representable as a computer program. After all, we pay attention to patterns in the world because they happen for reasons. As it turns out, if we measure probability in this way using algorithmic probability theory these two sequences heads and tails would [b]not[/b] have the same a priori probabilities. In algorithmic information theory, the algorithmic probability of a string (of data) x is related by negative power to the lengths (in bits) of all computable programs p that would output x on a universal Turing machine by the (alas uncomputable, but approximable) equation [img][/img]. Simple sequences, like all heads, can be generated by very short programs, and therefore have high algorithmic probabilities. Random or patternless sequences can only be generated by programs close to or exceeding the length of the sequence itself; hence they will, individually, have very low probabilities. Nonetheless, the class of sequences that are basically patternless far outnumbers the class of sequences with clearly discernible patterns. The a priori probability of a sequence belonging to the class of patternless random sequences is very great. The interesting-ness, and possibly relevance, of an event or set of data about an event must be related in some way to its information content. In Shannon’s information theory, given a prior probability distribution of possible messages or events, the information content of an event is inversely related to its probability. No one would be especially surprised to learn that Brazil or Germany had won the World Cup. We might have guessed it. Learning that Spain had won was information indeed. Learning the one is far more interesting than learning the other. In algorithmic information theory, algorithmically improbable sequences of data (random sequences) have high information content because the only programs that will output these sequences are programs that contain those sequences within them, and so are about as long as the sequences themselves. That a sequence is random in itself is not surprising, since the majority of possible sequences are random. The class of useful data fall in the middle ground between the extremely simple and extremely complex/random. To see this, compare three videos: an hour-long video of a stationary rock, another of a troop of chimpanzees, and another of random white noise on an old television set. Each of these in turn, let us presume, has an increasing algorithmic information content. Suppose also that we used the best lossless compression algorithm we could find to make the video file as small as possible. After a minute of watching the first video we don’t have much left to learn by the next 59 minutes. The rock will still be sitting there at the end of the video. Such a level of complexity does allow us to learn and base predictions upon certain regularities in the environment, but the information content is too low. The video file will be many times smaller than the original uncompressed video- essentially a single compressed frame to be repeated a sufficient number of times. What can we learn by watching the third video of white noise? Not much either. One moment of white noise tells us little about the next moment of white noise- just as one cannot learn to predict the proverbial fair coin by observation of prior tosses (one toss does not carry information about another. It all looks the same, even though every “frame” is distinctly different. The most information we can obtain about the video is the relative frequencies of different values on the screen. Yet, the compressed version of this video will be about the same size as the uncompressed video. What about the video of the troop of chimpanzees? Its rate of compression will be somewhere between the first and the third videos. Moreover, the information content of the video can be applied to build useful generalizations about the various dynamic regularities of chimpanzee behavior in the video. The information content is relatively high. Jon Barwise and Jerry Seligman have argued that information flow is the result of systematic regularities between parts in distributed systems. Information flow crucially depends on the reliability of those regularities. If smoke did not reliably indicate fire then smoke could not carry the information that there is a fire. But learning about fire from smoke makes smoke relevant because fire is relevant. Fire is relevant because of its systematic relationship to other important states of affairs. On the other hand, a sporadic event or unpredictable process is unlikely to carry much information about parts of the world it is not directly related to. One cannot learn one’s fortunes from the stars, and a butterfly’s flapping its wings in California won’t tell you that a leaf will fall from a tree in Paris. One toss of the dice does not carry information about the next. But the throw of the dice will carry information about the fortunes of a gambler. The falling of a granary or the tripping over a root may carry the information that a culprit for this misfortune will be sought. One can, by force of convention, make regularities be.

  • comment-avatar
    scott howard 20 July 2010 (12:50)

    There is also an interaction between statistics and our culture or means of learning. Random guesses on a bi-polar choice will be correct 50% of the time. So we see lots of amazing things happen by chance. If we do not monitor them for a longitudial study, then we often see a string of correct answers and assume or believe in some strange things as we have seen actual proof; we have real empirical evidence. Second is our attention to novelty. No one is reporting on all the octopi who got all the answers wrong. Finally is our tendency to generalize. 2+3=5 at school and at home so we assume it also equals 5 at the store. So we see one octopus get correct answers in one situation so we generalize that octopi must be very smart.