Re: [math-fun] Re: Google books / "text behind image"

2 Nov 2007


      Somewhere I got a pdf that had protection that would not allow copying 
and pasting. But I could view it, obviously, so on each page I made the 
magnification fairly large in Reader or Acrobat and did a screen 
capture, which I fed to an OCR program. Very inconvenient and useful 
mostly as a last resort, but it worked.

Steve Gray


Steve Witham wrote:
...
...
From: Henry Baker <hbaker1@pipeline.com>
However, since both the ascii text & PDF images are available, 
someone could easily automate the task of creating such PDF/text 
files -- even someone outside of Google.
You may be able to OCR the pdf files using the online DejaVu converter,
assuming there aren't protection bits that some part of the converter
gits skeered at:
   http://any2djvu.djvuzone.org/
Maybe you could then automate the process of correcting the embedded
text in the djvu from the ascii from Google.
A possible added benefit is that djvu format is much more compact than
pdf.  On the other hand, the reader for it (at least for the Mac)
isn't so nice.  Also, the default compression settings don't always seem
to do the right thing about distinguishing foreground from background in
illustrations.
A simpler trick: maybe the page numbers are in the ascii?
--Steve