How much trust should we put in experimental results?

Thinking back on the year 2010, cognitive scientists will probably remember it as the year the Hauser affair broke out after years of rumors. That summer, a Harvard investigation committee found Marc Hauser responsible for eight cases of "scientific misconduct". The most serious problem involved a missing control condition in an experiment published in Cognition. Since the experimenters were blind to the conditions they were testing, a weird and important mistake cannot be entirely ruled out. (See here.) The other seven cases involved various forms of sloppiness - whether it was guided by dishonesty is still hard to know. Then informal accusations of misconduct started pouring in, from colleagues and former students. Scientists who spoke against Hauser, like Gordon Gallup or Michael Tomasello, concentrated their attacks on various failures to replicate his results.

This post is not about Marc Hauser. Whether he was guilty of true misconduct or unintentionally produced false results matters less than the fact that we ('we' stands for us, cognitive scientists with an interest in morality, religion or culture) were fooled into accepting this work. "Accepting" is an understatement. We hyped it up, we based arguments upon it, we advertised it to philosophers, anthropologists and the greater public as an example of good science.

The situation would be a little more comfortable if we had not been serving for many years as self-appointed ambassadors of the Scientific Bushido to the social sciences and humanities.

For or against Hauser, the reactions I heard or read in the community were, basically, of two kinds. First, there were those who considered scientific fraud at that level an exceptional (or incredible) evil. Others, however, noted that few researchers could have published 130+ papers over five years while keeping all their data nicely stored and cleaned for the kind of close inspection Harvard inspectors gave Hauser's archives. They pointed out that replication failures happen all the time. And they made vague hints at colleagues (in other labs, other cities, other countries) who got away with much worse.

The righteous and the cynical tended to agree on one point. Everyone expressed concern that the Hauser affair might tarnish the public reputation of cognitive science, and eager to prevent that. This, I think, is exactly the lesson not to draw from the Hauser case.

In a well-functioning culture, the public reputation of our field should be influenced by what happened, just like the public reputation of economists was harmed by recent financial scandals. Changes in reputation following such events are noisy, nasty and unfair. That is how opinion works. Did we complain of noisy, nasty, unfair opinion when it was extolling cognitive science and ignoring other fields?

Guilty or not guilty, Marc Hauser is right on one thing: as Helen De Cruz pointed out, failure to replicate famous experiments is a common problem, especially in primate studies. The scale of the problem is anything but small, and it is not something we can entirely blame on fraud, nor a problem for primatologists alone. Researchers working on replicability suggest that many experimental fields suffer from a tendency to produce results that prove less and less robust with each attempt at reproducing them. In other words, wrong (unreliable and unreplicatable) experimental findings are more numerous than most of us would think.

Ioannidis' paper [1] on the topic, five years ago, is one of the most downloaded papers in the history of PLoS. It is all the more striking that it has been so slow in grabbing headlines. Two recent pieces, one in The New Yorker, the other in the Atlantic, have called the public's attention on the poor replicability of published experiments.

The Atlantic essay covers Ioannidis' two famous papers. The one in PLoS, in which he tries to make sense of the fact that "80 percent of non-randomized studies (by far the most common type) turn out to be wrong, as do 25 percent of supposedly gold-standard randomized trials, and as much as 10 percent of the platinum-standard large randomized trials." And the second one, showing that "Of the 49 articles, 45 claimed to have uncovered effective interventions. Thirty-four of these claims had been retested, and 14 of these, or 41 percent, had been convincingly shown to be wrong or significantly exaggerated".

The average scientist's reflex is to put the problem down to sloppiness, fraud, an urge to publish or a mix of those. The next step is a call for virtue, followed by a prayer to keep the public's attention away from the mess. The Atlantic essay reflects this attitude: it blames the poor replicability on bias, and blames the bias on ambition.

“The studies were biased,” he says. “Sometimes they were overtly biased. Sometimes it was difficult to see the bias, but it was there.” Researchers headed into their studies wanting certain results—and, lo and behold, they were getting them. We think of the scientific process as being objective, rigorous, and even ruthless in separating out what is true from what we merely wish to be true, but in fact it’s easy to manipulate results, even unintentionally or unconsciously. “At every step in the process, there is room to distort results, a way to make a stronger claim or to select what is going to be concluded,” says Ioannidis. “There is an intellectual conflict of interest that pressures researchers to find whatever it is that is most likely to get them funded.”

This is an attractive perspective, because it is a moral one - and we all experienced a feeling of righteousness when reading Ioannidis (or when reading about Hauser). Also because it is intuitive - yes, dear reader, you were probably thinking of some particular sloppy, ambition-driven scientists that you dislike -, and above all because it leaves our belief in the fundamental soundness of the experimental method untouched. Science being pure, individuals need virtue to keep up with its high standards. Individuals failed, their methods did not.

Sadly, there are reasons to think that praying for individual virtue is not likely to solve the problem. Even the most impartial experiments prove abnormally difficult to replicate. Something seems to be broken in our tools, not just in our conduct. Some convincing evidence is cited in Jonah Lehrer's fascinating New Yorker piece. It comes from all the studies that have been purposefully conducted to test for replication problems. In these studies, the replication trials are built to be as similar as possible to one another, and, quite often, the researchers want the replication to work. Yet the results are astoundingly noisy, much more than one or two replications would suggest.

Another problem that individual virtue alone cannot solve is the famous file-drawer effect: the results that make it into journals need to be provocative and important. Yet chance alone produces many sexy and important results that, upon replication, turn out to be weaker, or not to exist at all. A substantial part of the scientific literature might be nothing but the sexy tip of an iceberg of randomness, most of it lying hidden in the sea of unpublished findings.

The file-drawer effect is especially destructive since it is most often combined with p-value significance testing (see here, here [2] and here [3]). Among other flaws, p-value significance testing (at the .05 level in most fields of cognitive science) guarantees that chance alone will give us a significant result every twenty trials or so. In this post, Cosma Shalizi explains how this is sufficient in itself to generate spectacularly counter-intuitive results, and bolster them against refutation for some time. Of course, these results stand little chances of being replicated - but it might take a while before the community realizes that. What is more, Shalizi notes, the very presence of surprising false results that take some time to get refuted might give scientists the illusion that they are getting somewhere - since the field changes all the time - and bravely applying the scientific method - since we are bold enough to repudiate the fashionable hypotheses of ten years ago. None of this requires dishonesty or sloppiness.

Make a mistake. Acknowledge it today, it is a blunder - or a fraud. Discover it ten years later, and you may call it progress.

I am not saying that experimental science has become a machine that turns randomness into sexy falsities - or that we spend most of our time trying to interpret subtle noise. But if the million-dollars drug tests of the biggest medical reviews prove fragile, I would not bet that we, pioneers on the margins of hard science (or at the cutting edge, if you prefer), will do much better. After all, cognitive science produced more than its fair share of sexy ideas that did not quite live up to the hype they generated (Strong AI... Universal Grammar... Mirror Neurons...). The Hauser affair itself is not utterly without precedent (you might remember the ugly controversies surrounding Daniel Povinelli). Why would the future be different?

Whatever else we do, we are part of a growing trend in the social sciences and humanities: people with a psychological background trying to win the next generation of philosophers / anthropologists / theologians (pick your target) to the experimental method. We proudly wave the flag of Science. We have (some) money. We are vocal about the shortcomings of traditional methods - and we're lucky enough that our opponents ignore most of the flaws in ours. So we might very well succeed in turning many ethnographers, philosophers or theologians into experimentalists. Popular scientist (like Marc Hauser) have already succeeded in making experimental psychology an authoritative source of insights about ethics, religion or culture.

There is little probability that Marc Hauser, who is still running his lab, will lose tenure. He is still a member of our institute, by the way, and as far as I'm concerned, his name should stay on the members list (not be surreptitiously erased as it was elsewhere). If it eventually turns up that he was forgivably sloppy and biased, we will be glad we did not rush to conclusions. If he did fraud, then his name on the website will serve as a public reminder of our gullibility.


[1] Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine2(8), e124.

[2] Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics33(5), 587-606.

[3] Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.

11 Comments

  • Hugo Mercier 9 January 2011 (23:10)

    Great post, thanks Olivier! Here's a post on Rob Kurzban's blog that points in the same direction (at least as far as social psychology is concerned) http://www.epjournal.net/blog/2010/12/measuring-the-trajectory-of-scientific-theories/

  • Olivier Morin 10 January 2011 (09:32)

    Interesting post by Kurzban. It probably means I should stop criticizing nGram. Also, to quote Casablanca's Captain Renault, I'm shocked, [i]shocked [/i]to find that social psychology is not a cumulative science like physics.

  • Davie Yoon 10 January 2011 (15:10)

    Olivier wrote: [quote]Individuals failed, their methods did not... Even the most impartial experiments prove abnormally difficult to replicate. Something seems to be broken in our tools, not just in our conduct. [/quote] Strong words, which for me, require strong evidence. What, in your opinion, are the best examples of "the most impartial experiments" proving "abnormally difficult to replicate"? There are a couple of judgment-y aspects of this assertion (most impartial / abnormally difficult) for which I wonder where your opinion lies. While I wait for your response, let me give me intuition about what a major problem may be in many failures to replicate: ambiguity as to the degree of similarity between experimental methods (including analysis). In other words, what may be broken in our "tools" -- which to me constitutes a community standard of conduct -- may not be the Scientific Method, but rather [b]our method of describing our methods[/b]. I'm sure many other young scientists like me have had the experience of talking to someone whose research I would like to build on, and discovering an important (or potentially important!) aspect of the experimental design, procedure, or analysis that not described in the paper. Such experiences kill the beautiful fantasy of published scientific literature as a comprehensive, error-free repository of Knowledge -- but can confer some healthy perspective. I believe that, based on such experiences, many shrewd experimentalists choose not to make a strong endorsement of a study unless they can in some way "see for themselves" what actually happened in the other person's study. This can range anywhere from visiting a colleague's lab to participate in the experiment yourself, perusing raw data from individual subjects, or evaluating good records of how the experiment proceeded (e.g., a videorecording or simulation with actual stimuli). Choosing one of these options is especially easy for me since I study human vision and (1) I am a human, and (2) simply viewing the experimental stimulus is often all that's needed, for example, in the case of visual illusions. I would never conduct a replication or an extension of someone else's work just based on reading the often very sparsely written methods section of a published paper. Moreover, I feel uncomfortable extolling the virtues of another person's empirical work without reading the details of the study, or talking directly with the person who actually conducted the study (how do you do quality control?, etc). This probably makes me sound like a nitpicky, highly neurotic and conservative fuddyduddy. But my goal in writing this comment is not to kill our enthusiasm in blogging or chatting about cool new abstracts that we see on the web or so forth. Instead, I hope that these examples of non-replication and misconduct-related controversies encourage us to take some extra effort when we go beyond being merely interested bystanders to becoming invested advocates.

  • Tom Rees 10 January 2011 (15:50)

    Perhaps there's a selection process going on in science as a whole. I found my PhD frustrating. It was difficult to replicate what others had done, and my own experiments usually gave unclear results. I decided I was simply not a good experimentalist, and dropped out of science practice (I write about it instead, now). I wonder now whether it was not my technique but my approach that was the problem (probably wishful thinking - still don't think I'm much of an experimentalist). But it is very easy to explain away results that don't conform to your hypothesis, and so clean up your data that way. Perhaps those who are most effective at that are the ones who stay on?

  • Olivier Morin 10 January 2011 (17:58)

    The examples I had in mind were those cited in Lehrer's piece, plus one personal favorite. Crabbe's experiments with mice show that tightly controlled replications can diverge widely from the original result. So does Schoole's failure to replicate his own effect. My favorite example, though, is Michelson and Morley's attempt at finding traces of aether winds, the failure of which provided Einstein with the first piece of direct evidence for his theory. The original goal was to find a certain difference in speed between the two beams of light of an interferometer. (We now know that the speed of light is the same in every direction, but, of course, they didn't.) According to the einsteinian legend, no difference was ever found. That is not true. Michelson and his coworkers consistently found speed differences - not as important as they predicted, but still far too important for Einstein to be proved right. Between 1881 (Michelson's first attempt) and the 1950s, most replications of Michelson's experiment found substantial variations in the speed of light, depending on the direction of the beams. Michelson's student Miller stuck to his results to the bitter end. Of course, he was wrong, and it became obvious with the discovery of Laser. But the important point is that, at the time the theory got established, Einstein and the others could not know that Michelson and Miller, the world's leading experts on interferometer experiments, were doing something wrong. I think I know what you are thinking, Davie. "All these cases are just failures of methodological skills. Crabbe's assistants were not careful enough. Schoole got better at testing his own effect, he eliminated artifacts and, in so doing, reduced his effect size. Likewise, physicists perfected the Michelson-Morley apparatus. So the problem comes from sloppiness, after all." At least for the last two examples, you are probably right. And the solution you offer is commendable: don't be sloppy, beware of artifacts, and never allow yourself to judge an experiment unless you have intimate knowledge of the tools, the setting, the people, etc. This works for a brilliant young scientist like you, who has access to all the labs that count, who is expert at so many things it makes me positively envious, and who works in a field like vision science where many effects can be judged by simply watching the stimuli. But is that a generalizable solution? Take blind reviewers: they are not supposed to know anything apart from what can be learnt in the Methods section. Or take most cases of secret or competitive scientific research. But the biggest problem is probably the fact that replications are so seldom attempted. You feel confident about an experiment when you have seen the data, met the authors, seen the stimuli, etc. But, of course, you didn't take the trouble to replicate their work - and most of the time, neither did they. The research I alluded to suggests that there may be some surprises in store for everyone. To arrive at last to the rest of your comment, it raises a classical and important question (best treated, I think, in this book): to what extent are we ready to let replication become skill-dependent? Not all skills are easily transferrable, and some of them are quite rare. Nobody knew the workings of an interferometer better than Michelson and Miller. Everybody believed Marc Hauser to be uniquely gifted to study cotton-top tamarins. So one had to go by their word. A note of hope: the history of science teaches us that the problem of skill-dependence and the problem of replication can be solved. In Physics, many experiments are impossible to replicate, because of the skills and money involved. They overcame the problem in two ways (it seems to me): first, by making experimental skills transferrable and standardized (that's where good "Methods" sections come in handy, as you noted). Second, by not letting the experiments speak for themselves. Einstein did not believe Michelson and Miller because, after 1920, their data made no sense at all. Sound arguments and good theory can save us from experimental failures.

  • Nicolas Baumard 10 January 2011 (21:21)

    Davie and Olivier seem to be engaged on a methodological debate about the difficulty of replication [i]per se[/i]. However, it seems to me that the main problem with non replicability is institutional. As Ioannidis write: [quote] “At every step in the process, there is room to distort results, a way to make a stronger claim or to select what is going to be concluded,” says Ioannidis. “There is an intellectual conflict of interest that pressures researchers to find whatever it is that is most likely to get them funded.” [/quote] He gave a hint about how this can happen: [quote]Imagine, though, that five different research teams test an interesting theory that’s making the rounds, and four of the groups correctly prove the idea false, while the one less cautious group incorrectly “proves” it true through some combination of error, fluke, and clever selection of data. Guess whose findings your doctor ends up reading about in the journal, and you end up hearing about on the evening news? Researchers can sometimes win attention by refuting a prominent finding, which can help to at least raise doubts about results, but in general it is far more rewarding to add a new insight or exciting-sounding twist to existing research than to retest its basic premises—after all, simply re-proving someone else’s results is unlikely to get you published, and attempting to undermine the work of respected colleagues can have ugly professional repercussions. [/quote] This, it seems to me that the main problem here is the pression to publish. On the one hand, publications and impact factors allows evaluating researchers and save money in financing the most efficient researchers. On the other, its by-product are very costly: we do not how much trsuful are all these publications because everyone has (the scientists, their institutions, the journals) in publishing as much as possible. To me, this is the main problem of today's science: the extrinsinc pressure to publish (get a position, get some money, etc.) are so strong that you cannot trust scientists' intrinsic motivations to seek for the truth and be honest. I have no specific solution to propose. But I agree with Olivier, one thing the citizens (who fund us) whould know is that peer-review is not synonymous with truth. [quote]In a well-functioning culture, the public reputation of our field should be influenced by what happened, just like the public reputation of economists was harmed by recent financial scandals. Changes in reputation following such events are noisy, nasty and unfair. That is how opinion works. Did we complain of noisy, nasty, unfair opinion when it was extolling cognitive science and ignoring other fields?[/quote] Well said Olivier!

  • Jacob Lee 12 January 2011 (09:25)

    I wonder if the establishment of (more?) journals exclusively devoted to negative results might ameliorate the problematic consequences of the sorts of institutional pressures pointed out by Nicolas Baumard.

  • Nicolas Baumard 13 January 2011 (11:43)

    Well, I guess the problem will still hold: The journal of negative result will have a smaller imapct factor than the standard journals. It seems to me that the solution would be to put more emphasis on the intrinsic motivation of scientists, to fund them for a longer period and to stop evaluating them mainly with bibliometrics indicators. After all, there was when bibliometry did not exist and when science was still working!

  • Jerome Pesenti 13 January 2011 (16:29)

    Olivier I agree that the problem is methodological but I think that the Hauser affair hides deeper problems. As is proven by the recent publication on ESP, you don't need any kind of misconduct to produce bogus results. You just need to use statistical tools in an accepted way. People like Paul Meehl or Jacob Cohen in psychology, David Freedman in sociology or Deidre McCloskey in economy, have shown decades ago that the (statistical) tools used are completely inappropriate. The problem is, if we were to implement their recommendations, it would become 10x more difficult to publish and many people would lose their jobs. The same people who are in charge of reviewing/accepting publications. There is an inherent conflict of interest which means that the fields themselves won't implement stricter practices. Only through big scandals affecting their external reputation will they be forced to do so.

  • Olivier Morin 13 January 2011 (17:51)

    To Jerome and Nicolas: You are absolutely right: this is not a matter of methods, narrowly construed. Methods matter because they have become institutionalized. You could find a comparison in economics: mathematical tools like portfolio management techniques were flawed in themselves, but the real problem was, of course, that these methods were endorsed by important credit rating agencies, investment funds, etc. The methods were institutionalized precisely because they could be used to invent incredibly alluring investments. Just like investors stuck with bogus financial products, people who trust science now have to deal with a potentially vast number of bogus findings. Now, let me sound a more optimistic note. I don't think that changing methods/institutions need be quite as painful as you two seem to think. Jerome, most of the people you cite did not advocate a change in methods that would lead to less results being published. On the contrary, they notice that significance testing will very often produce false negatives. Likewise, the problem witht the file-drawer effect lies in the fact that not enough results are published. I don't think anyone advocates restricting publication for statistical reason: what we need is not less information. I agree with Jacob Lee: creating journals accepting "uninteresting" results (PLoS might be an example) looks like a good idea to me. Yes, Nicolas, publications in these journals will not be rated as high as others. But that is part of the point: the value of publications in general should be rated down, the value of robustness and replicability being updated. The new journals could help do just that. Will people lose their jobs? There will probably be a reallocation of resources between disciplines. The psychology of cotton-top tamarins, for a start, is likely to suffer. In my opinion, so should the study of linkages between mental diseases and genes. But many fields are not as dependent on the Big Mistakes (think of History, Philosophy, Physics, Logic, Law...). I am not worried about science in general. Another piece of good news: there will be plenty of jobs for statisticians selling tin-canned Bayesian packages that no one will understand, to replace the tin-canned Fisherian packages that nobody knew how to use. It's a wonderful life.

  • Jerome Pesenti 13 January 2011 (19:13)

    Olivier, I don't worry about science either but I do believe the changes need to be painful. The amount of [i]original[/i] published research needs to drastically diminish (5x, 10x, 100x?). With proper use of statistical tools (which IMO means simpler tools & reasoning, not more advanced), it's much harder to find new results (as Freedman puts it, it requires much more "shoe leather"/ground work than statistical tinkering). Yes, some of that could be compensated by doing more replication and publishing more negative results, but that won't be as exciting. I remember discussing this exact topic with your Ph-D supervisor 15 years ago during my master at the EHESS. I complained about some research in his department making grand claims about mind theory based on "significant" 0.1% difference in response time to simple questions. Dan's take at the time was that if it was significant, it was worth being published. I hope he has changed his mind!