[kwlug-disc] DMS for linux

Steve Izma sizma at golden.net
Sun Oct 31 15:15:21 EDT 2010


On Fri, Oct 29, 2010 at 06:11:24PM -0400, Chris Frey wrote:
> Subject: [kwlug-disc] DMS for linux
> 
> With the recent thread on digitizing books, what do folks use to manage
> the resulting scans?  Is there a document management system for this
> task, or is it mostly custom?
> 
> I assume that the software would be different if you stopped at the OCR
> stage, vs. going all the way to OCR'd web pages.

At work (WLU Press) I have to deal with roughly a couple of
thousand of digital "assets" (a commodifying term, if I've ever
heard one) and I've looked at various DAMs and used a commercial
one (cloud-based) extensively. I think they're all awkward.

I don't see any point in keeping book-length documents in a
database, so the issue is how to find them quickly on your
filesystem, so I'm experimenting with Python scripts and strict
naming conventions for files and directories. Since we're a
publisher, we have ISBNs, and these make it easy for an
identifying scheme, along with filename suffixes that identify
the type of book: PDF (various types), print-on-demand, epub,
etc.

It's a good idea to have a database for metadata (title, author,
publication date, etc.) and that can be used to quickly search
for a book based on title words, author, and so on. With a strict
file naming convention, it's easy to construct a pathname once
you've gotten the ISBN out of the database.

Having also worked for a long time as a typesetter, I'm also
interested in producing readable text from OCR output (we have
this problem a lot at work). If the original is not very
layout-intensive, my method is to use scripts to roughly tag the
text as xml, clean it up with vim, then use groff and friends
to create PostScript and PDF output. The macro language for
groff is a pretty good programming language, so you can write
a lot of conditional statements for automatically handling
typographical issues that in a desktop publishing system (e.g.,
Quark, InDesign, Scribus) require manual intervention.

	-- Steve

-- 
Steve Izma
-
Home: 35 Locust St., Kitchener N2H 1W6 p:519-745-1313 FAX:519-579-9872
Work: Wilfrid Laurier University Press p:519-884-0710 ext. 6125
E-mail: sizma at golden.net or steve at press.wlu.ca

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
<http://en.wikipedia.org/wiki/Posting_style>




More information about the kwlug-disc mailing list