[math-fun] Re: Google books / "text behind image"
From: Henry Baker <hbaker1@pipeline.com> However, since both the ascii text & PDF images are available, someone could easily automate the task of creating such PDF/text files -- even someone outside of Google.
You may be able to OCR the pdf files using the online DejaVu converter, assuming there aren't protection bits that some part of the converter gits skeered at: http://any2djvu.djvuzone.org/ Maybe you could then automate the process of correcting the embedded text in the djvu from the ascii from Google. A possible added benefit is that djvu format is much more compact than pdf. On the other hand, the reader for it (at least for the Mac) isn't so nice. Also, the default compression settings don't always seem to do the right thing about distinguishing foreground from background in illustrations. A simpler trick: maybe the page numbers are in the ascii? --Steve
Somewhere I got a pdf that had protection that would not allow copying and pasting. But I could view it, obviously, so on each page I made the magnification fairly large in Reader or Acrobat and did a screen capture, which I fed to an OCR program. Very inconvenient and useful mostly as a last resort, but it worked. Steve Gray Steve Witham wrote:
From: Henry Baker <hbaker1@pipeline.com> However, since both the ascii text & PDF images are available, someone could easily automate the task of creating such PDF/text files -- even someone outside of Google.
You may be able to OCR the pdf files using the online DejaVu converter, assuming there aren't protection bits that some part of the converter gits skeered at: http://any2djvu.djvuzone.org/
Maybe you could then automate the process of correcting the embedded text in the djvu from the ascii from Google.
A possible added benefit is that djvu format is much more compact than pdf. On the other hand, the reader for it (at least for the Mac) isn't so nice. Also, the default compression settings don't always seem to do the right thing about distinguishing foreground from background in illustrations.
A simpler trick: maybe the page numbers are in the ascii?
--Steve
If you use a Mac, then the PDFLAB program will set the access control information for a pdf file. Tell it you want to encrypt it, and it will ask for new permissions and a password. It sets them without checking for the existing encryption key. There was a period, somewhere around 1996, when journals set these annoying flags. They seem to have learned not to, but there are files with them still set in the archives. Very annoying. Nature is one of the offenders. On Nov 3, 2007, at 12:10 AM, Steve Gray wrote:
Somewhere I got a pdf that had protection that would not allow copying and pasting. But I could view it, obviously, so on each page I made the magnification fairly large in Reader or Acrobat and did a screen capture, which I fed to an OCR program. Very inconvenient and useful mostly as a last resort, but it worked.
Steve Gray
Steve Witham wrote:
From: Henry Baker <hbaker1@pipeline.com> However, since both the ascii text & PDF images are available, someone could easily automate the task of creating such PDF/text files -- even someone outside of Google.
You may be able to OCR the pdf files using the online DejaVu converter, assuming there aren't protection bits that some part of the converter gits skeered at: http://any2djvu.djvuzone.org/
Maybe you could then automate the process of correcting the embedded text in the djvu from the ascii from Google.
A possible added benefit is that djvu format is much more compact than pdf. On the other hand, the reader for it (at least for the Mac) isn't so nice. Also, the default compression settings don't always seem to do the right thing about distinguishing foreground from background in illustrations.
A simpler trick: maybe the page numbers are in the ascii?
--Steve
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
Note that the "Copy & paste forbidden" property is just one bit that the reader can ignore. You can easily "fix" the problem in e.g. xpdf. But one should refrain from distributing the modified software (or even documents!). Neither will I describe how this is done (you'll have to chgange just one line in an obvious way). I often use this to get an ascii image of pseudocode given in papers. Btw. these codes always have severe flaws. The "features" that use actual encryption may be harder to circumvent, I did never bother to try. * Tom Knight <tk@csail.mit.edu> [Nov 05. 2007 09:21]:
If you use a Mac, then the PDFLAB program will set the access control information for a pdf file. Tell it you want to encrypt it, and it will ask for new permissions and a password. It sets them without checking for the existing encryption key. There was a period, somewhere around 1996, when journals set these annoying flags. They seem to have learned not to, but there are files with them still set in the archives. Very annoying. Nature is one of the offenders.
On Nov 3, 2007, at 12:10 AM, Steve Gray wrote:
Somewhere I got a pdf that had protection that would not allow copying and pasting. But I could view it, obviously, so on each page I made the magnification fairly large in Reader or Acrobat and did a screen capture, which I fed to an OCR program. Very inconvenient and useful mostly as a last resort, but it worked.
Steve Gray
Steve Witham wrote:
From: Henry Baker <hbaker1@pipeline.com> However, since both the ascii text & PDF images are available, someone could easily automate the task of creating such PDF/text files -- even someone outside of Google.
You may be able to OCR the pdf files using the online DejaVu converter, assuming there aren't protection bits that some part of the converter gits skeered at: http://any2djvu.djvuzone.org/
Maybe you could then automate the process of correcting the embedded text in the djvu from the ascii from Google.
A possible added benefit is that djvu format is much more compact than pdf. On the other hand, the reader for it (at least for the Mac) isn't so nice. Also, the default compression settings don't always seem to do the right thing about distinguishing foreground from background in illustrations.
A simpler trick: maybe the page numbers are in the ascii?
--Steve
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (4)
-
Joerg Arndt -
Steve Gray -
Steve Witham -
Tom Knight