The Laboratorium : GBS: Confessions of a Book Pirate

Interesting interview with a compulsive book uploader and downloader. Lots of interesting details, but I found the following most striking:

TM: How long does it take you to scan a physical book?

TRC: The scanning process takes about 1 hour per 100 scans. Mass market paperbacks can be scanned two pages at a time flat on the scanner bed, while large trades and hardcovers usually need to be scanned one page at a time. I’m sure that some of the more hardcore scanners disassemble the book and run it through an automatic feeder or something, but I prefer the manual approach because I’d like to save the book, and don’t want to invest in the tools. Usually I can scan a book while watching a movie or two.

Once scanned, the output needs to be OCR’d – this is a fairly quick process using a tool like ABBYY FineReader.

The final step is the longest and most grueling. I’ve spent anywhere from 5 to 40 hours proofing the OCR output, depending on the size of the book and the quality of type in the original. This can be done in your OCR tool side-by-side with the scan of the original image or separately in your final output type (RTF, DOC, HTML, etc.). If there are few errors on the first few pages of text my preference is to proof in RTF, otherwise I do the proof within Finereader itself.

That’s a lot of time.

January 26, 2010 at 4:19 PM

Frances Grimble

Confessions of a habitual scanner of genuinely public-domain material:

If you whack pages onto a scanner and don’t worry about the settings, and you run the text through OCR without proofreading, it’s fast. That’s the Google method. (Even so, all those value-added human fingers appearing on the Google scans are highly unusual.)

If you carefully adjust settings for halftones and fine line illustrations (line illustrations require different optimal settings than type, by the way) it’s slow. If you run filters over each photo to eliminate moire (the ugly plaid effect caused by conflicting patterns of dots in the original published photo and the scan), it’s very slow. It you have to hand edit each photo—by no means unusual—it’s very, very slow.

If you proof several hundred pages of OCR—and I’ve used Finereader, it produces plenty of garbage—it’s slow.

With method A you get a lamentable historic preservation. With method B you get a book as good as the original, and in some cases better.

Pirates don’t care if the scan or OCR is of low quality. So let me advance another argument:

The easier it is to copy a document, the more people will do it. It’s harder to scan a paper book. Do people do it? Sure, but they are less motivated than when an e-file is already provided. It’s like, people being more motivated to buy a brand of toothpaste because they found a 25-cent coupon in the Sunday newspaper ads. Twenty-five cents doesn’t usually make any difference in their personal economy, but it’s enough incentive for many to choose Colgate over Crest this week.

That’s why I support DRM.

January 26, 2010 at 4:25 PM

Frances Grimble

By the way, if this person can proof any normal-sized book in 5 hours, he or she is much faster than I am, and I’m what the publishing trade considers fast. Unlike editing, the time spent has little relation to the number of errors. The time spent depends on the number of pages. You actually have to read the whole book, and more slowly than if reading for pleasure.

January 26, 2010 at 4:44 PM

john walker

More a mental health problem than a legal problem?