The Laboratorium : GBS: 129,864,880 Books in the World (At Last Count)

In a recent blog post, Leonid Taycher from Google’s Books team walks through some of the bibliographic challenges associated with counting and cataloguing books. After excluding the T-shirts with ISBNs, recognizing that “Lecture Notes in Computer Science, Volume 1234” and “Proceedings of the 4th international symposium on Logical Foundations of Computer Science” are the same book, and many other analyses, they create a fresh snapshot of their best guess at the extant books of the world twice a week. The most recent count: 129,864,880. That’s a lot, and a lot more than the count of books Google has scanned, but it’s not an unmanageably large number.

UPDATE: Citing familiar concerns from linguists and librarians, and new ones mentioned in the blog post, Ars Technica says that Google’s count is “probably bunk.”

August 8, 2010 at 9:36 PM

Frances Grimble

But there is the problem that it is by no means legal for Google (or anyone else) to scan all books without permission.

August 10, 2010 at 11:23 AM

Eric Hellman

Gary Price’s commentary on the Ars Technica post is more interesting than the post itself.

August 10, 2010 at 12:56 PM

Jerome M. Garchik

What I think is most significant about Leonid Taycher’s post and calculations is that he uses a different definition of “BOOK” than that in GBS I and GBS II. This suggests Google does not feel itself bound by the legal commitments it made when it signed and submitted the pending GBS I and II to the Federal Court, and that Google will do as it pleases based on business expediency regarding the publishing and literary industries, instead of observing the legal provisions and structures set out in the GBS .

August 10, 2010 at 5:53 PM

Frances Grimble

Another interesting point is that judging from the public-domain scans posted on Google Books and the metadata associated with those, it is clear that Google often scans multiple copies of the same title. Apparently, if it comes in from Library A they just scan it, if it later comes in from Library B they scan it again, if it yet later comes in from Library C they scan it yet again, and so on. With different cataloging data attached to each copy publicly posted. There’s all kinds of confusion in their public data regarding multi-volume works, as well as editions. Note that with 19th-century works reprints of exactly the same material are often called “editions” in the work itself. Sometimes the exact same material was simply retitled for a reprinting for marketing reasons (a practice that has not entirely disappeared).

Given all that, and setting aside the fact that Google is also scanning bound collections of magazines and lumping them in with books, I don’t see that Google is even bothering to count the number of unique books scanned. To be fair, the online library meta-catalog Worldcat also not infrequently has multiple entries for the same title, at least for older works. But Worldcat is not doing anything illegal.

I think Leonid Taycher’s post is merely an assertion that Google can, and will, scan all books, regardless of copyrights, lawsuits, protests by copyright owners, and any and all other actions on the part of everyone who opposes Google’s mighty will. Which is extremely troubling. I cannot address Jame’s statement as to what Google employees believe, but certainly any belief that pirating copyrighted works is good for society goes hand in hand with such piracy being extremely likely to yield enormous profits for Google.