[math-fun] Re: Google books / "text behind image"

4 Nov 2007

      ...
From: Henry Baker <hbaker1@pipeline.com>
I tried searching the downloaded PDF document for Tait's 
Quaternions, but nothing was there.  So it appears that Google isn't 
using "text behind image", at least not for this book.  However, 
since both the ascii text & PDF images are available, someone could 
easily automate the task of creating such PDF/text files -- even 
someone outside of Google.
I downloaded the PDF and submitted it to any2djvu.djvuzone.org, and
got a converted .djvu file, but without text behind, even though I
specified OCR in the entry form.  So, that path isn't working for now.

I don't know whether it's a technical issue, a protection-bits issue,
or a web server problem at djvuzone.  The usual progress log didn't come
out, so I don't have any error messages (I emailed them about it).

The image compression is pretty nice.
    2.9 vs. 9.7MB.
    The type, equations and line drawings are excellent.
    The faint scribbles left by readers, which are, say, "thresholdy" in
       the pdf, are slightly more thresholdy in the djvu.
    There first page of the publisher's catalog in the back has a couple
       stains.  Text quality over the stains is as good in the djvu as pdf.
    The cover, opening pages and closing pages have a couple places with
       dark sections with writing, which are definitely blurrier in the
       djvu.  Djvu has a concept of "background" which is converted with
       lower resolution; it sometimes makes the wrong decision if you don't
       hand-tweak it.
    The viewer I have (MacDjView) scrolls a lot faster through the
       document than Adobe Reader and Preview do through the pdf.

DjVu ought to be included into the pdf standard as a codec.

  --Steve

Steve Witham

tags

participants (1)