GBS: OCR Issues in the GBS Corpus


Danny Sullivan has been playing around with the new Google Ngram Viewer. He’s found that it has some trouble properly recognizing the medial S. When your corpus includes a great many pre-1800 uses of the word “suck,” this produces some unfortunate search results.

Natalie Binder has a series of posts on related issues:


Judging from my experience with OCR (using ABBYY FineReader), there are a lot of problems with OCR that has not been proofed by a human. OCR wants to interpret everything on the page as a character, or part of one. You can separate text from images, but that still leaves on or near the text foxing, mildew, damp stains, dirt, little bits of paper, hand-written notes, and everything else that gets deposited on the pages of books over time. You seldom see really clean pages in an old book. So, that leaves a lot to be cleaned up and an automated spell checker—providing one was even used—is not up to the task.


Oh ‘suck’! it was just a tea leaf


The Ngram viewer might tell you something about publication bias , but beyond that…? My partner is a sort of historian of the sociology of the behavior of academic professional groups. In this area what dos not get to be considered as appropriate for publication can be far more interesting than what dos get published.

Can it deal with the changing/confused uses of words like in-un-inflammable or anti as in before, anti as in against, and anti-art as in post-modern. Actually ‘modern’ as in a temporal discriptor or modern as a quality or historical period: “modernism” could create a few problems.


I noticed some similar problems when I was wondering why people were talking about Hitler in the 1600s. They weren’t.

http://www.librarian.net/stax/3427/google-books-ngrams-on-hegel-and-hitler-and-ocr/


jessamyn

in Cochin (in the latter 1600s) the sultan facilitated the building of a beautiful synagogue for his Jewish subjects, on land near to his temple to Krishna ( least I think it was to Krishna) Anyway you get there by going to Jewtown and Swastika Lane , words can have very different meanings.


Background on how Google “proofs” their O.C.R.’s

">http://www.nytimes.com/2011/03/29/science/29recaptcha.html?_r=1&src=mv&ref=science”>Deciphering Old Texts, One Woozy, Curvy Word at a Time


Background on how Google “proofs” their O.C.R.’s

Deciphering Old Texts, One Woozy, Curvy Word at a Time