How much of a difference does culture make ?
In my latest post, I mentioned a very nice study that looked at differences in face-processing between East Asians and Westerners. Though it made a couple of fascinating points, the study also claimed that Asian culture strongly hindered Asians from understanding Western emotions. In fact, their statistically significant result was much too weak to warrant that conclusion. A recent pamphlet has been looking, among other things, at what makes scientists confound the statistical significance of an effect with its importance. The debate over the significance of significance has precedents in cross-cultural psychology.
I just finished Stephen Ziliak and Deirdre McCloskey's vital pamphlet, The cult of statistical significance: how the standard error is costing us jobs, justice, and lives. These two economic historians did an excellent job of convincing me that everyone in the human sciences (medicine included) should heed their advice and read the book. It's quite a nice read, too – except when the authors let the gossip and the anecdotes run loose, which makes you feel like you're squeezed at a conference buffet between two economy professors talking shop and lambasting some unknown colleague. Anyway, the book's message, old and banal as it can be in certain circles, is a crucial one.
Ziliak and McCloskey
It is about null-hypothesis significance testing. Before you stop reading, please remember that it is the tool you use in order to prove reviewers that your data are worth publishing. It is the mandatory p < 0.05 threshold over which there is no publishable truth. It has become the de facto golden standard of scientific validity.
This kind of significance testing has been under attack for many years in various fields, including psychology.
Many problems come from the fact that we in the soft science are eager to use statisitcal tests as a badge of scientificity and as a way of getting published ; yet most of us (and that includes yours truly) are really bad at handling these. Basic mistakes are routine. The most widespread are : using the word "significant" ambiguously to imply that an effect is important or big ; assuming that a significant result rules out the null hypothesis or even proves our own hypothesis ; and various minor sins like unwarranted assumptions of normalcy, outlandish null hypotheses, etc. Most significance test are half-understood prepackaged softwares. Yet these problems are not specific to null-hypothesis significance testing (would we be any better at handling Bayesian tools ? I very much doubt it).
The great strength of Ziliak and McCloskey's approach is its radicalism. Their target is not so much the many misuses of the test – although they have a lot of clever things to say about it. They target the very idea of using such tests. By choosing significance testing over any other measure of error, they claim, scientists have traded size for precision. Precision means being able to recognize an effect against the surrounding noise. Size is how much of a difference an effect makes in the world. Significant effects are precise, but they do not necessarily make a big difference to the state of the world. With a large enough sample, you can easily get easily recognizable effects, yet these effects will be too small to make a difference. With smaller samples, however, significant testing will tell you to ignore important effects that are just too noisy to get under the p < 0.05 threshold, but nevertheless make a huge (albeit a little chaotic) difference to the state of the world.
This reminded me of a debate that took place in 2001 among cross-cultural psychologists. David Matsumoto and two colleagues re-analyzed several studies published in the Journal of Cross-Cultural Psychology (paper here). For each of the studies they examined (including one from their own research group), they showed that significant differences, most of the time, do not indicate substantive or important differences between individuals. In other words, statistically signigificant differences were insignificant by any other standard. For example, in a famous paper by Matsumoto and Ekman (1989), American subjects score higher than Japanese subjecst in a disgust-recognition task. But the probability that an American subject to score higher than a Japanese subject was only slightly above chance (0.56). There was a great overlap between Japanese and American results. In the recent study on disgust-perception by Asians and Westerners, another negligible-yet-significant difference was exaggerated by the authors – who claimed that the phenomenon had "critical consequences for cross-cultural communication and globalization".
As usual, the rhetoric twist is implicit and not acknowledged by the authors. As Matsumoto et al. write about the articles they analyzed, "Although most researchers are aware of these limitations of significance testing, most reports rely solely on them, largely ignoring the practical significance of the results." Ziliak and McCloskey came to the same conclusion in their analysis of articles of the American Economic Review. Ziliak and McCloskey would probably concur with Matsumoto et al., who write :
"Interpretations of cultural differences between people based on statistically significant findings may be based on practically insignificant differences between means. Just pause to consider the wealth of knowledge concerning cultural differences in any area of cross-cultural comparison that some may assume to be important or large on the level of individuals because previous research has documented statistically significant differences betweenculturemeans.How many of these are actually reflective of meaningful differences on the level of individuals? Unfortunately, the answer is unknown, unless tests of group differences in means are accompanied by measures of cultural effect size such as those presented here. If theories, research, and practical work that are supposedly applicable to individuals are based on such limited group difference comparisons, theories, research, and applied programs based on these cultural differences may be based on a house of cards."
But there is more to Ziliak and McCloskey's point than a lesson in methodology. Their deeper, and even more worrying point, is that a few decades ago, the quantitative sciences lost sight of the quantitative. We should reclaim it. Few things in life are worth the pain of learning some stats, but this seems like one.