GBS: Quantiative Cultural Analysis Demonstration Project


The Google Books team and a group of researchers from Harvard published a paper in Science, Quantitative Analysis of Culture Using Millions of Digitized Books. The abstract:

We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of “culturomics”, focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

Here is an over-the-top article about it from the Guardian; here is a more restrained article from PC Magazine. And here is something rather more interesting: a tool to see trends in word use over time in the Google Books corpus. Just put in a term or terms, and you can see how the frequency with which they’re used changes over the years.

UPDATE: The New York Times has a well-written story on the paper and tool.

UPDATE: Geoffrey Nunberg has a typically long and thoughtful essay on the paper and tool in the Chronicle of Higher Education.


From Scholars Elicit a ‘Cultural Genome’ From 5.2 Million Google-Digitized Books the Chronicle of Higher Education:

The paper and the public data-mining tool come as Google’s broader book-digitization effort remains in legal limbo. Authors and publishers have besieged that project, calling it copyright infringement, but a legal settlement has yet to be approved.

Asked how Google was protecting the copyright of the books in its new tool, a spokeswoman, Jeannie Hornung, said the publicly available data sets “cannot be reassembled into books.”

Just because the the data sets “cannot be assembled into books” does not make Google’s copy of the book any more legitimate. In producing copies of my entire book the University of Wisconsin and Google infringed my copyrights, regardless of the uses they made of those copies.

Douglas Fevens, Halifax, Nova Scotia— The University of Wisconsin, Google, & Me


“In producing copies of my entire book the University of Wisconsin and Google infringed my copyrights, regardless of the uses they made of those copies.”

No matter how many times you say this (and you must have said it a hundred times so far), it doesn’t make this statement true. Asserting it one more time does nothing to advance anyone’s understanding of the issues in the case.


It’s also a tool for finding bad date metadata in Google’s corpus, but it doesn’t appear to have been used that way.


My response to Peter Hirtle’s comment above.


Geoffrey Nunberg’s essay is smart (and laugh-out-loud funny, too). Google has given us a nifty new tool, not a new science.


Did Google actually proof the OCR for this database? They don’t seem to have done it from the scanned books per se, judging from the public-domain ones.


Frances Grimble: Did Google actually proof the OCR for this database?

I don’t believe that human proofreading is part of Google’s normal workflow for scanned books.