[kwlug-disc] digitizing books

Thu Oct 28 16:58:05 EDT 2010

As some of you know, I've got 10's of thousands of pages digitized, and 
most of it already online.  It's quite frankly, an absolute horrible 
business to do.  (for one of my dead, non-commercial projects, see 
fishing buddies dot ca, I've got one book online there).

Scanning is a pita.  The problem with a regular flatbed is that you have 
to bend the book, the pages don't scan well.  I use an opticbookpro book 
scanner, but it's manual and one page at a time, which is extremely time 
consuming and beats up the books as well.  USB scanning technology means 
9seconds per page.

Because I post my books online all my stuff has to be old enough to be 
out of copyright so my books are delicate.  Handling gets to be a 
consideration both from time and condition of the books.

Commercial quality book scanners cost $10K and up.  I actually found a 
used one once, it was like Christmas that day.  Drove to Barrie to buy 
it.  Turned out to be a paperweight.  I guess I got coal in my stocking 
that year.

The other site similiar to the one richard's pointed out is 
diybookscanners.com.  They have some home brews there along with 
directions.  The basic trick is to have a V shape so you can image two 
pages at once without bending the book, and then using cameras instead 
of scanner technology.  Cameras allow you to take an image in a second 
or less, instead of the 9-10 seconds you'd see on a scanner.

So now you've got your 500 page books scanned.  The next issue for some 
of us is OCRing it.  That's a whole other ballgame that's again very 
manual intensive.  There's pretty much nothing on Linux that does a good 
job of it.  As soon as you run into nonstandard page layout, or 
pictures, or tables, things go bad, fast with OCR software.

Spend the time OCRing it and the next problem is indexing and 
organizing.  The hardware book's already organized into chapters, with a 
table of contents and index. After you image and OCR you've lost that 
meta-data.  I'll leave it to your imagination to figure out how to 
automate the replacement of a table of contents :) (hint, it's not 
really something you can automate).

If you want to go to the next step and put the book online, now you've 
got to find a way to import all these pages of poorly OCR'ed text :) 
into a web page and again, keep the meta data link indexes and table of 
contents.  I've had to have custom software written to handle that task.

In the end, if you're looking to just image books and have a file of 
images you can flip through on your computer, that machine will work 
reasonably well.  If you want to take it to the next step things get 
very manual very fast.

Nevertheless, projects like the one Richard pointed out are moving us 
forward fast. I'm excited, and hope one day to be able to digitize 
hundreds more books that are sitting in boxes in my house.

On 28/10/10 04:10 PM, Richard Weait wrote:
> I know that this topic and hardware will find a small and enthusiastic
> crowd here as well.
>
> If you had to scan a book, what would you use?
>
> http://bookliberator.com/doku.php
>
> _______________________________________________
> kwlug-disc_kwlug.org mailing list
> kwlug-disc_kwlug.org at kwlug.org
> http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org
>