Somewhere I got a pdf that had protection that would not allow copying and pasting. But I could view it, obviously, so on each page I made the magnification fairly large in Reader or Acrobat and did a screen capture, which I fed to an OCR program. Very inconvenient and useful mostly as a last resort, but it worked. Steve Gray Steve Witham wrote:
From: Henry Baker <hbaker1@pipeline.com> However, since both the ascii text & PDF images are available, someone could easily automate the task of creating such PDF/text files -- even someone outside of Google.
You may be able to OCR the pdf files using the online DejaVu converter, assuming there aren't protection bits that some part of the converter gits skeered at: http://any2djvu.djvuzone.org/
Maybe you could then automate the process of correcting the embedded text in the djvu from the ascii from Google.
A possible added benefit is that djvu format is much more compact than pdf. On the other hand, the reader for it (at least for the Mac) isn't so nice. Also, the default compression settings don't always seem to do the right thing about distinguishing foreground from background in illustrations.
A simpler trick: maybe the page numbers are in the ascii?
--Steve