GBS: What Do We Know About the Books Out There?


Brian Lavoie and Lorcan Dempsey, Beyond 1923: Characteristics of Potentially In-copyright Print Books in Library Collections, D-Lib Magazine, November/December 2009, tries to give some tentative answers about the shape of the elephant. From the introduction:

The analysis that follows examines the characteristics of US-published print books, with an emphasis on books that are likely in copyright according to US copyright law. As with our earlier article, the analysis is based on data from the WorldCat database, which represents the aggregated collections of more than 70,000 libraries worldwide. The analysis focuses on three areas: the WorldCat aggregate collection of US-published print books; the subset of this collection published during or after 1923 - i.e., those potentially associated with copyright and/or orphan works issues; and the combined print book collection of three academic research library participants in Google Books - again, with an emphasis on materials that are potentially in copyright.

Lots of detailed tables on dates, authors, genres, audience level follow. From the conclusion:

This article characterizes the aggregate collection of US-published print books in WorldCat, with a special emphasis on materials published during or after 1923, and therefore either potentially or definitely in copyright. Findings from the analysis indicate that the collection of US-published print books in WorldCat is quite large, encompassing about 15.5 million print books. Nearly two-thirds of these - those published after 1963 - have a high likelihood of being in copyright; less than 15 percent - those published prior to 1923 - are almost certainly in the public domain, with the rest - those published between 1923 and 1963 - potentially in copyright if copyright was renewed. The post-1923 materials collectively account for more than 80 percent, or about 12.6 million, of the US-published print books in WorldCat. It is difficult to predict how many of these print books might be orphan works, but even a small fraction would, in terms of absolute numbers, be considerable, and require a substantial effort to investigate and clear copyright. One study, based on an examination of a random sample of books, estimates a cost of approximately $200 for each title for which digitization and access permissions were obtained.

(Via ResouceShelf.


The definition of a book in GBS 1.0 & 2.0 “a written or printed work … published or distributed … or made available for public access as a set of written or printed sheets of paper” goes far wider than conventionally published books. It includes dissertations, reports, monographs and - apart from the items specifically excluded - just about anything on paper that has been deposited in a library.

Also, while WorldCat may be able to identify the country of publication, the Settlement “find & claim” algorithm cannot. There are 11 “books” under my name in the GBS database (a thesis, a monograph, a pamphlet, a festival program & 7 books published by New Zealand publishers. 4 of the books are commercially available, 3 of the 4 have been “digitized without authorization”. None of the books appear in the search results when I enter my name as author, & type “New Zealand”, “NZ” or “N.Z.”into the imprint box of the search engine. All I get back is the pamphlet I wrote for the Heart Foundation of New Zealand.

Hint: if you haven’t already got an account on the GBS claim site, get one. You can use it to search the database not only for your own books but for all the “books” listed in the names of all the authors and publishers you wish to enter. Most of “books” “digitized without authorization” and recognised by GBS as published in NZ are monographs authored and published by organisations (New Zealand Romney Marsh Sheep Breeders Association, New Zealand Oceanographic Institute etc). I suspect the authors of these works have no idea that the GBS concerns them.


Lynley Hood said: “Hint: if you haven’t already got an account on the GBS claim site, get one.” Sorry Lynley, but I refuse to go to a Google site. To me it seeks to legitimize what I feel is Google & Company’s illegal activity regarding the digitization of my book. I do not what to send the message that I condoned their digitization of my work. I have even deleted Google from my web browser’s tool bar because I saw no reason why I should have to stare at their brand the whole time I’m on the net. Douglas Fevens, Halifax, Nova Scotia The University of Wisconsin, Google, & Me


It’s important to clarify what the numbers in the Dempsey/Lavoie article represent. Each “book” that is counted represents a published product at about the same level of granularity that today would be given an ISBN. Therefore if a publisher re-issues a book in their backlist after the previous print run has been exhausted (say, a decade later) and with a new introduction, it is considered a different book. The publication date that is fed into the study is the date of the new issuing of the book. Also, as publishers re-package and re-print public domain books, these also are considered separate products with new ISBNs and new dates. Thus, if you look up a commonly re-published book like “Moby Dick, Or The Whale” in the Library of Congress catalog, you retrieve 40 items (and more if you use the short form of the name, simply “Moby Dick”), of which only one is pre-1923 — that one was published in 1851. Of the other thirty-nine instances of the publication of the work, which range from 1925 to 2006, some contain what GBS called “inserts” - that is, separately copyrightable intellectual property in the form of introductions, etc., but others may be a straight republication of the text.

What Google lacks the ability to do (yet?) is to make the proper connection between the original text that is in the public domain and the many “manifestations” (as they are called in library-speak) that were published later — and are also in the public domain, at least as far as the primary text is concerned. This is a non-trivial exercise when one is working only with the metadata that describes the work, but may become more feasible with the ability to do a full text analysis of the contents of the various packages in which publishers have placed the original work of Melville. I assume that Google is working on this, although I cannot predict how it will affect their assessment of the PD/(c) split.