The dawn of “culturomics”

A team lead by Jean-Baptiste Michel and Erez Lieberman Aiden (Harvard University) just published in Science a paper "Quantitative Analysis of Culture Using Millions of Digitized Books" that promises to open a new era in the study of cultural evolution.

We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

This research was partly supported by Google's work effort to digitize books. Visit their new Ngram viewer!




Comments Disabled

  • Olivier Morin
    Olivier Morin 5 January 2011 (22:02)

    [url=]This post[/url]from Savage minds pretty much sums up my feeling about this paper…

  • Nicolas Pain 9 January 2011 (15:56)

    Yes Olivier, Ngram is not the perfect tool the authors think it is. But this post is not accurate. The authors say : [quote]By converting the text of the scanned books into a single, massive “n-gram” database – a map of the context and frequency of words in history – scholars could do quantitative research on the tomes without actually reading them[/quote] Matt Thompson writes : [quote]Take that hermeneutics! Now we can interpret texts without reading them[/quote] Aiden and Michel’s purpose is not to suggest that quantative approaches are to take the place of any qualitative and close-reading approaches. They mechanize the quantitative studies : instead of counting on your fingers the number of appearance of “being” in each book published in Germany and in german between 1927 and 1945, a machine will do that for you. Aren’t you happy ? Matt Thompson writes : [quote]Why are word frequencies significant? The whole thing seems like a glorified version of those inane word count “studies” done by Ok Cupid like this one on gays and straights or this one on whites and non-whites. To understand a word’s meaning (as opposed to its definition) you have to look at context, style, and tone. In short you have to read to interpret.[/quote] Well of course, you cannot expect to find the meaning of word by merely counting how many times this word appears. Is meaning the only that count when you study culture ? Of course not. The meaning of a word cannot help us if we want to know if, by any chance, “terrorism” is more important than “welfare”, in the books published in the USA, between 1995 and 2008 : see the [url=]results here[/url]. Of course, Ngram is not the allpowerful : I wish we could have more refined data, a way to select the context (academic journals, evedyday newspaper, monthly news, web+books+journal, etc). The more precise are the variables of the study, the more the data are reliable, whether we work on a small sample (12% of the books) or on a bigger one. Mark Thompson writes : [quote]But there are limitations to how far this analogy can go. There is no part of the genome that is beyond the purview of genomics, but culturomics, with its data set limited to scanned pages of Google Books does not consider “culture” in its entirety.[/quote] Well, when you say that your neighbour is brave as a lion, you don’t expect this guy to be actually a lion… Of course the analogy has its limits. The problem lies in what sort of questions you ask. I think much of the reliability relies on the variables. If you want to test an hypothesis, you don’t ask something about culture in general. You are more specific. You control the set, the conditions of your test, and what types of answer you may have. Thus, the problem is not that Ngram “does not consider “culture” in its entirety”, but that it does not provide the methods to obtain more precise conditions of research. Mark Thompson writes : [quote]More importantly genetics, of which genomics is but a subfield, is only one of many different perspectives for understanding an organism or population. You are not your genes.[/quote] Who said that quantitative approaches to texts are going to take the place of any qualitative approaches ? I don’t quite understand this reaction. Mechanization in literature, linguistics, is, in certain area of research, commonplace. There is only one thing new with Ngram: the quantity of scanned documents. Why is everybody freaking out ? Ngram is only the first study with a very very big amount of books ! It should be great news !

  • Olivier Morin
    Olivier Morin 9 January 2011 (20:05)

    Hello NP, thanks for the comment! Rest assured, I am not “freaking out”, just nonplussed. True, if I had found this article elsewhere, I would have found it funny, intriguing and ingenious. But, considering the fact that this is a Science paper advertising no less than the birth of a new science, I feel that the questions the authors asked their data are just a little bit trite. Of course, as you point out, the authors are not declaring war on qualitative approaches, or hermeneutics. (You don’t need to declare war on a dying discipline when you publish in Science, and the other guys don’t get the tenth of your funding.) But they sure don’t seem very interested in many disciplines and issues that might have given them better questions to ask. And, given the influence of Science, this is (just a little bit) worrying. But, you are right, it’s still an interesting paper about an interesting technique.

  • Nicolas Pain 11 January 2011 (10:31)

    Hi Olivier ! Thanks for your answer. My purpose was not to show that Ngram is the best tool ever, but that the post you post a link to is not accurate. Frankly I am not really thrilled by Ngram : as I pointed out, the only news with this tool is the actual and potential amount of scanned data. But we have the right to expect more when we look at the available softwares in that area of research. Because it is not enough : compared to automata which analyze the context, cluster, etc, and other word-counting-software Ngram has a lot to learn ! By the way, your last post on ICCI is really nice.