Scan Them All and Let Google Sort Them Out

Towards the end of Jean-Noöl Jeanneney’s Google and the Myth of Universal Knowledge, he writes:

In practical terms, what criteria will govern the decision to digitize certain works? With respect to the vast legacy of works now in the public domain … we should favor the great founding texts of our civilization, drawing from each of our countries: encyclopedias; journals of scholarly societies; major writings that have contributed to the rise of democracy, to human rights, and to the recent unification of the Continent; writings that have fostered the development of literary, scientific, legal, and economic knowledge, as well as artistic creation. We should add to these, as I’ve already suggested, works that have appeared in numerous translations, thus attesting to their influence. The same guidelines, probably with less rigid specifications, can be followed for the more recent period. (p. 78)

A little earlier, he explains why this rigorous process of selection is necessary:

A final observation: maybe one of the reasons that the top managers of Google never seriously broach the question of how works are to be digitized is that they maintain the conviction—or rather the illusion—that they can digitize all the books that have ever been printed since the time of Gutenberg. In this fantasy world, there would be no need to worry about selection, and the performance of the digital library would depend only on the quality of the search engine (or engines). But since this perspective is beyond what we can reasonably envision (Is this a bad thing?), we must find the means not only to furnish Internet users with organized knowledge but to indicate its limitations. (p. 73)

This is, in a word, oldthink. Jeanneney repeatedly assumes that comprehensive digitization of our print archives is a pipe dream, from which it follows that the selection processes governing digitization acquire enormous cultural and political importance. I certainly agree with him that the selection is critical, particularly in the shorter run. It is, for example, of great importance that Google’s Book Search scanning project be accompanied by equally ambitious projects for non-Anglophone collections. But that’s as far as this particular observation belief to go.

We are going to have the capability of scanning everything, and we should. Scan it, OCR it, check it, stick it online, and open it up to lots of search tools. (Not just one: he and I agree on this, as well.) The initial Google announcement involved some fifteen million books. The total number of titles printed in the West since Gutenberg is somewhere upward of a hundred million. Google’s proposal is ambitious but clearly realizable; aiming for a hundred million is somewhat more ambitious but not unreasonably so. It will seem more and more plausible with time, as scanning and indexing technology continue to improve.

Things will be chaotic, certainly. There will be duplicated scans, scans of different editions of the same book, scans of translations and pirated foreign editions, scans of books missing pages, and so on and so forth. But these are not dealbreakers. These are exactly the sort of gnarly semistructured data analysis that drove the last few rounds of stunning innovation in Web search. Get the corpus out there and search algorithms will arrive to take advantage of it. Call it a Say’s law of data: put an interesting dataset online and someone will find something interesting to do with it. The point is that massive scanning helps create raw material on which the complexity-increasing dynamism of the Internet feeds. Committees of experts can help us decide what to scan first, but they should not have to decide what to scan at all.

Jeanneney, president of the Bibliothèque nationale de France, clearly loves the potential of digital archiving and appreciates the value of search. But he doesn’t get search. Again and again he complains: “An indeterminate, disorganized, unclassified, uninventoried profusion is of little interest.” (p. 7) “Under these conditions, an undertaking of this kin, attractive as it appears, can hardly be pursued effectively other than within a restricted community capable of ensuring quality under cooperative control.” (p. 51) “The fantasy of exhaustiveness dissipates in the need for choices.” (p. 71) “Hasty classification of a list, following obscure criteria of classification, must be replaced by a whole range of modes, classification modes for responses and presentation modes for results, to allow for many different uses.” (p. 72) “And we must help their teachers by protecting them from disorganized information.” (p. 87)

Exactly, I would say. That’s exactly what good search does. It turns a profusion of scattered information into accessible, organized forms. Jeanneney is right to demand diversity both in the information accessible and in the tools used to access it. But he doesn’t seem to get the idea that the best way to create useful order online is to embrace the chaos. Wikipedia’s lack of a “restricted community” helps it produce more reliable information, not less. Google works because it indexes everything, rather than picking and indexing a subset of high-quality sites. Jeanneney sees a cluttered desk and assumes a disorganized mind.

There is much else to say about this fascinating, baffling, brilliant, confused, maddening, and thoroughly Gallic sliver of a book, but this is the thought that stuck most in my mind as I read it.

I look forward to reading Jeanneney’s book. I just don’t get why people like him don’t see that the true threat to universal digitization is overexpansive copyright.

I love this story from the libertarian von Mises institute, examining copyright as an aspect of omnipotent government:

“For some years, Misesians have worried about the status of Mises’s wonderful book Omnipotent Government (1944)… . It demonstrates that the Nazi ideology was a species of orthodox socialist theory, … The book has long deserved far more attention than it has received.

however, the current publisher would not allow the text to be put online through the Mises Institute. Many of Mises’s books have been online and, as a result, were being referred to and quoted and discussed (and purchased) as never before. But not Omnipotent Government. It was not getting the attention it deserved, and, indeed, faced the prospect of forever living in the shadows of those books that are online.

After three years of letters, emails, and phone calls, we finally persuaded the publisher to let us go ahead, but we could only do so on the condition that we compensate the publisher in advance for all the lost sales they were sure that they would absorb.

What happened was precisely the reverse of what the publisher expected. Instead of lost sales, the sales of the book shot up. In the few weeks since the text went online, more copies of this book left our warehouse than during the whole of the last decade. Omnipotent Government is now a top seller in the catalog. The publisher obtained not only the leasing fee from our offices but suddenly enjoyed a flood of new orders for the book from us.”