[kwlug-disc] OCR to web

Ralph Janke txwikinger at ubuntu.com
Fri Jan 7 10:24:04 EST 2011


  A lot of books on Project Gutenberg are done by the Distributed 
Proofreaders (http://pgdp.net).

They are using a combination of ocr and "crowd sourcing". The book is 
first ocred, depending on the font
with either standard ocr software, or self-trained (especially for 
fraktur books).

Then images and ocred text is uploaded to the website and goes page by 
page through a 5 step process
of proofreading (3 Steps) and formating (2 Steps). After this a 
"post-processor" takes everything and
uses some self-developed tool for text-analysis etc to created out of 
the proofread pages the finished
products which are then uploaded to Project Gutenberg.

The code for the pgdp site is open source and can be downloaded from 
sourceforge.net

I used to be very active with pgdp several years ago, and at the time 
most ocr was done on proprietary ocr software
running on windows. The Linux ocr software was still rather in alpha 
state than usable. I am not sure if this
is different today. The probably best open source ocr software was 
tesseract. Otherwise, I have used
gocr and ocrad before. 
http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison 
compares
the different ocr systems.

I also have somewhere some software written in perl that I used to use 
for pgdp to make the ocr process
a little more automized.

Ralph


On 01/06/2011 05:21 PM, Chris Frey wrote:
> On Thu, Jan 06, 2011 at 04:51:22PM -0500, Insurance Squared Inc. wrote:
>> As some of you know, I scan out of copyright books and publish them on
>> the web.  I've struggled with this process for years.  I'd like your
>> input on the following:
>> - any knowledge of decent linux OCR with gui that will let me OCR say a
>> 500 page book?
>> - let's say I've got the book(s) ocr'ed.  So I've got 100's or thousands
>> of .txt files and the same number of image files.
>>      - how do I get those into a useable web platform i.e. get them into
>> a cms?
>>      - what cms suits this type of application?
>
> Anyone know what Project Gutenberg uses?  They have text and HTML versions
> of their books, and some of the HTML versions look pretty good.
> Might be worth asking them.
>
> - Chris
>
>
> _______________________________________________
> kwlug-disc_kwlug.org mailing list
> kwlug-disc_kwlug.org at kwlug.org
> http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org





More information about the kwlug-disc mailing list