How much of a difference does culture make ?

In my latest post, I mentioned a very nice study that looked at differences in face-processing between East Asians and Westerners. Though it made a couple of fascinating points, the study also claimed that Asian culture strongly hindered Asians from understanding Western emotions. In fact, their statistically significant result was much too weak to warrant that conclusion. A recent pamphlet has been looking, among other things, at what makes scientists confound the statistical significance of an effect with its importance. The debate over the significance of significance has precedents in cross-cultural psychology.

I just finished Stephen Ziliak and Deirdre McCloskey’s vital pamphlet, The cult of statistical significance: how the standard error is costing us jobs, justice, and lives. [1] These two economic historians did an excellent job of convincing me that everyone in the human sciences (medicine included) should heed their advice and read the book. It’s quite a nice read, too – except when the authors let the gossip and the anecdotes run loose, which makes you feel like you’re squeezed at a conference buffet between two economy professors talking shop and lambasting some unknown colleague. Anyway, the book’s message, old and banal as it can be in certain circles, is a crucial one.

Ziliak and McCloskey

It is about null-hypothesis significance testing. Before you stop reading, please remember that it is the tool you use in order to prove reviewers that your data are worth publishing. It is the mandatory p < 0.05 threshold over which there is no publishable truth. It has become the de facto golden standard of scientific validity.

This kind of significance testing has been under attack for many years in various fields, including psychology.

Many problems come from the fact that we in the soft science are eager to use statisitcal tests as a badge of scientificity and as a way of getting published ; yet most of us (and that includes yours truly) are really bad at handling these. Basic mistakes are routine. The most widespread are : using the word “significant” ambiguously to imply that an effect is important or big ; assuming that a significant result rules out the null hypothesis or even proves our own hypothesis ;  and various minor sins like unwarranted assumptions of normalcy, outlandish null hypotheses, etc. Most significance test are half-understood prepackaged softwares. Yet these problems are not specific to null-hypothesis significance  testing (would we be any better at handling Bayesian tools ? I very much doubt it).

The great strength of Ziliak and McCloskey’s approach is its radicalism. Their target is not so much the many misuses of the test – although they have a lot of clever things to say about it. They target the very idea of using such tests. By choosing significance testing over any other measure of error, they claim, scientists have traded size for precision. Precision means being able to recognize an effect against the surrounding noise. Size is how much of a difference an effect makes in the world. Significant effects are precise, but they do not necessarily make a big difference to the state of the world. With a large enough sample, you can easily get easily recognizable effects, yet these effects will be too small to make a difference. With smaller samples, however, significant testing will tell you to ignore important effects that are just too noisy to get under the p < 0.05 threshold, but nevertheless make a huge (albeit a little chaotic) difference to the state of the world.

This reminded me of a debate that took place in 2001 among cross-cultural psychologists. David Matsumoto and two colleagues re-analyzed several studies published in the Journal of Cross-Cultural Psychology [2]. For each of the studies they examined (including one from their own research group), they showed that significant differences, most of the time, do not indicate substantive or important differences between individuals. In other words, statistically signigificant differences were insignificant by any other standard. For example, in a famous paper by Matsumoto and Ekman (1989), American subjects score higher than Japanese subjecst in a disgust-recognition task. But the probability that an American subject to score higher than a Japanese subject was only slightly above chance (0.56). There was a great overlap between Japanese and American results. In the recent study on disgust-perception by Asians and Westerners, another negligible-yet-significant difference was exaggerated by the authors – who claimed that the phenomenon had “critical consequences for cross-cultural communication and globalization”.

As usual, the rhetoric twist is implicit and not acknowledged by the authors. As Matsumoto et al. write about the  articles they analyzed, “Although most researchers are aware of these limitations of significance testing, most reports rely solely on them, largely ignoring the practical significance of the results.” Ziliak and McCloskey came to the same conclusion in their analysis of articles of the American Economic Review. Ziliak and McCloskey would probably concur with Matsumoto et al., who write :

“Interpretations of cultural differences between people based on statistically significant findings may be based on practically insignificant differences between means. Just pause to consider the wealth of knowledge concerning cultural differences in any area of cross-cultural comparison that some may assume to be important or large on the level of individuals because previous research has documented statistically significant differences betweenculturemeans.How many of these are actually reflective of meaningful differences on the level of individuals? Unfortunately, the answer is unknown, unless tests of group differences in means are accompanied by measures of cultural effect size such as those presented here. If theories, research, and practical work that are supposedly applicable to individuals are based on such limited group difference comparisons, theories, research, and applied programs based on these cultural differences may be based on a house of cards.”

But there is more to Ziliak and McCloskey’s point than a lesson in methodology. Their deeper, and even more worrying point, is that a few decades ago, the quantitative sciences lost sight of the quantitative. We should reclaim it. Few things in life are worth the pain of learning some stats, but this seems like one.


[1] Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.

[2] Matsumoto, D., Grissom, R. J., & Dinnel, D. L. (2001). Do between-culture differences really mean that people are different? A look at some measures of cultural effect size. Journal of Cross-Cultural Psychology32(4), 478-490.

2 Comments

  • comment-avatar
    Simon Barthelme 31 August 2009 (18:00)

    I haven’t read Ziliak and McCloskey’s article but it sounds like the sort of criticism Bayesians have been firing at significance testing for decades. Berger’s classic book on Statistical Decision Theory has a whole list of pathologies, if you don’t mind the equations. Andrew Gelman (who works in political science and stats) keeps pointing out that the null hypotheses that are set up are usually false. For example, if the null is that East Asians and Westerners do not differ on a particular performance measure, it is trivially false for all sorts of uninteresting reasons (maybe one population is a bit more motivated than the other, maybe one has more experience with video games, etc.). So if you want to find “significant” differences between cultures, in the statistical sense, all you need is a large enough sample size. Without a prior theory predicting at least the direction of the effect, for instance that East Asians should do better in a certain task, the intercultural differences we find in experiments may be entirely trivial. Even if we had null hypotheses that made sense, and significance testing was actually what we wanted (i.e., we really wanted to control the false positive rate), it would be rendered useless by the actual practice of experimenters. One of them is to try different kinds of tests until your data come out significant. The other is to run subjects until your test comes out significant (discussed [url=http://www.stat.columbia.edu/~cook/movabletype/archives/2009/08/the_work-until-.html]here[/url]). Finally in an issue like intercultural differences I suspect a huge file-drawer effect: no one is interested in null results. All in all the best thing to do when reading an experimental psychology paper is to ignore the p-values and look at the graphs.

  • comment-avatar
    Pascal Boyer 2 September 2009 (21:19)

    Ziliak and McCloskey’s book is indeed remarkable – it is also, more surprisingly, remarkably funny. In particular, the authors quite cheerfully admit that their cogent, meticulous arguments about the follies of significance testing and the cult of P will almost certainly have no effect whatsoever on current (mal)practice, especially in psychology and the social sciences. Indeed, similar points had been made by Cohen in the 60’s, by Gerd Gigerenzer in the 80’s, to no avail. These authors, and Ziliak and McCloskey after them, recommended paying attention to effect sizes rather than p values, reporting power, reporting confidence intervals. Yet the same emphasis on p values is characteristic of most research papers published in our fields. If I may mention a pet peeve of mine, the practice of not just trusting p values, but actually [i]ranking[/i] them is particularly offensive. It is rife in political science, in which one commonly finds results adorned with star-ratings (* for .05, ** for .01, *** for .001) , as if these ranks denoted different degrees of support for the hypotheses, which is of course not the case. Another interesting aspect of this phenomenon is that almost everyone in the fields concerned (well, especially in psychology) will recognize that p values are useful but potentially misleading and that their current use as a standard of confirmation, disregarding power and effect sizes, is insane. Yet the social practice goes on! Another interesting case for the epidemiology of scientific norms. As for the issue of cultural differences – Simon Berthelme and Olivier Morin are of course right. It is almost impossible to test two culturally different samples without finding differences (at p.05). This is like the good-bad old days of cognitive neuroscience, when all you had to do to get a study published was to plonk people in the scanner and get them to do two different things… with the guaranteed result that activations would be significantly different. The problem for our field is not just that the null hypothesis in many cases is absurd or trivially false, it is also that “culture” is considered an independent variable in most of psychology. So when you find that e.g. Chinese and US students do not react the same way, you can publish this as evidence that “culture” matters in the domain of behavior you tested. The idea of having to explain [i][/i]why[i][/i]this cultural difference arose or is maintained is not even considered. After all, psychologists study the effects of aging but do not think they should have to explain aging! So it is our job to persuade everyone that “culture” is there to be explained but that in itself it explains nothing. Back to work!