[kwlug-disc] OCR to web

John Van Ostrand john at netdirect.ca
Fri Jan 7 10:44:51 EST 2011


A year ago I tried two open source or packages, one was tesseract. I had a 
poor experience with both. The out of box experience was that it converted 
very few word correctly. I used it in an urgent situation so I didn't have 
time to understand or interact with the many switches and options.

In the end I had a windows user or it for me.


----- Original Message -----
From: kwlug-disc-bounces at kwlug.org <kwlug-disc-bounces at kwlug.org>
To: kwlug-disc at kwlug.org <kwlug-disc at kwlug.org>
Sent: Fri Jan 07 10:24:04 2011
Subject: Re: [kwlug-disc] OCR to web

  A lot of books on Project Gutenberg are done by the Distributed
Proofreaders (http://pgdp.net).

They are using a combination of ocr and "crowd sourcing". The book is
first ocred, depending on the font
with either standard ocr software, or self-trained (especially for
fraktur books).

Then images and ocred text is uploaded to the website and goes page by
page through a 5 step process
of proofreading (3 Steps) and formating (2 Steps). After this a
"post-processor" takes everything and
uses some self-developed tool for text-analysis etc to created out of
the proofread pages the finished
products which are then uploaded to Project Gutenberg.

The code for the pgdp site is open source and can be downloaded from
sourceforge.net

I used to be very active with pgdp several years ago, and at the time
most ocr was done on proprietary ocr software
running on windows. The Linux ocr software was still rather in alpha
state than usable. I am not sure if this
is different today. The probably best open source ocr software was
tesseract. Otherwise, I have used
gocr and ocrad before.
http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
compares
the different ocr systems.

I also have somewhere some software written in perl that I used to use
for pgdp to make the ocr process
a little more automized.

Ralph


On 01/06/2011 05:21 PM, Chris Frey wrote:
> On Thu, Jan 06, 2011 at 04:51:22PM -0500, Insurance Squared Inc. wrote:
>> As some of you know, I scan out of copyright books and publish them on
>> the web.  I've struggled with this process for years.  I'd like your
>> input on the following:
>> - any knowledge of decent linux OCR with gui that will let me OCR say a
>> 500 page book?
>> - let's say I've got the book(s) ocr'ed.  So I've got 100's or thousands
>> of .txt files and the same number of image files.
>>      - how do I get those into a useable web platform i.e. get them into
>> a cms?
>>      - what cms suits this type of application?
>
> Anyone know what Project Gutenberg uses?  They have text and HTML versions
> of their books, and some of the HTML versions look pretty good.
> Might be worth asking them.
>
> - Chris
>
>
> _______________________________________________
> kwlug-disc_kwlug.org mailing list
> kwlug-disc_kwlug.org at kwlug.org
> http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org


_______________________________________________
kwlug-disc_kwlug.org mailing list
kwlug-disc_kwlug.org at kwlug.org
http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org




More information about the kwlug-disc mailing list