diglib Archive
Date: Tue Apr 19 09:28:35 2005
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

diglib: file naming



For those of you who do not read this other
list, I thought this was a useful exchange
on a topic that we have also touched on.

Carol


Date: Sun, 17 Apr 2005 06:57:21 -0700 (PDT)
From: Jack Kessler <kessler@well.com>
To: Kate Boyd <boydkf@gwm.sc.edu>
cc: diglib@infoserv.inist.fr
Subject: Re: [DIGLIB] filenaming

Kate,

You ask a number of interesting and fundamental filenaming
questions:

> Is it still important to keep only 8 characters for the
> filename?  If not, should there be some uniform limit for all
> the collections, or not?

Yes it still is important, sadly, to stay with 8 characters.  A
number of Windows features go back to the old DOS 8-character
maximum, for example default field lengths can show only 8 in
some applications, and sort orders can become confused by longer
labels in others. Applications software, too, can run into this
in trying to interact with Windows: even where current Windows
versions have accommodated longer labels, the application may
have been built for earlier versions which didn't.

You also are dealing with the broader problem of legacy systems,
and legacy users: unless you want your data to be for the use,
forever and exclusively, only of in-house systems and personnel
-- and this won't happen, even if you want it -- you have to plan
on someone sometime, with an old computer, and/or old software,
and/or an old personal mentality and approach to all of this,
trying to use your data.

Say, for example, that your user is some emeritus professor who
has been collecting slides and files and other data since DOS or
even before. She then dumps it all into the same hopper, together
with some of your data -- tell me that they don't do this... --
and then tries to sort it. The inter-filed labels end up in all
sorts of weird sort-orders as a result: some algorithms will dump
the over-8-character labels all at the end in a jumble, others
will treat the extra characters as extensions and sort them,
others will not even recognize them, others will do other things.
The point is to adopt a "lowest common denominator" approach to
labels: the DOS 8-character maximum.

There are various workarounds for this. One would be a two-field
label, or more: first field containing 8 characters lowest common
denominator, second and any others more flexible and able to
contain more data -- all systems presumably could do useful
sorting on the first field, and some might be able to sort on the
others -- a database.


> Is it important to use only lower case alphanumeric characters?
> Will other characters like hyphens cause issues in databases
> and servers?

All-lower-case alphanumeric is best, and avoiding all
non-alphanumeric -- such as hyphens -- also is best...

Just now I am working on a imaging project myself and Windows XP
is advising me,

        A file name cannot contain any of the following
        characters: \ / : * ? " < > |

Incidentally I see that Windows XP "Help", latest version and
updated, generally advises:

        Some programs cannot interpret long file names. The limit
        for programs that do not support long file names is eight
        characters. File names cannot contain the following
        characters: \ / : * ? " < > |

-- so do heed the suggestion above about the 8 character maximum,
as well...

I believe that you will run into the same sorts of limitation in
trying to burn a CDROM or DVD, too.

You need to check those for field-length limits, in addition,
even on your fields #2, #3, #4 etc.: one I now am using has a
"26-character" limit -- they never tell you whether this includes
spaces or not, and you have to check that -- another has a
"103-character" limit. My suggested workaround (above) will avoid
that so long as you have a useful but strictly limited tag in
your field #1.

And Internet browsers, too, can have difficulty with non -
alphanumeric sometimes: earlier and non-updated versions both
have strange "reserved" character systems -- and early "ASCII"
and "Microsoft ASCII" and "IBM ASCII" versions -- which choke on
anything but the 26 characters of English plus the 10 digits of
base 10. Again, it's the "legacy systems & users" problem: users
should have updated their systems but they didn't, and they
won't, but they'll still blame you...


> How important is it to relate the name of digital files to the
> actual object being scanned?  Is this better than arbitrary
> consecutive numbering to the files, so that there is overall
> uniformity with all of the digital collections.

You need both. Per the above comments about "field #1", that has
to be minimal lowest common denominator data, for sorting
purposes. At the same time users will be viewing documents
through thumbprint slideshows and "My Pictures" listings, where
minimal sort-order labels will be meaningless.

So a suggestion would be something like the following:

        06.439 1989 08 Field study 05 Nebraska

-- all one label

-- initial 6-space numeric for sort purposes, offers both "album"
descriptor ("06") and "page" descriptor ("439")

-- then date, with year ("1989") preceding month ("08") for
sort-order purposes -- hoping that your algorithm goes that far...

-- then alpha title and sort-order descriptor ("05") for that

-- and finally location, etc.

-- even if you carefully avoid duplicating that "439" you are
relying on sorting (unique identifier), in your own _and_ your
users' systems, either to sort the entire field or to truncate
the sort at that initial space after "439". But most will... And
you can pack a great number of access points / descriptors into a
single-field 38-character (including spaces) label...

But you do need to design that label with your 8-character
alphanumeric maximum in mind: far better, if you can, to offer
simply,

        06.439<space>

-- or even better just --

        06439<space>

-- and do leave out the hyphens...


> Any ideas or thoughts on fiilenaming for digital collections
> will be greatly appreciated.  Thank you for your time on this.

I'd despair of standardizing any of this, myself. It's too early.
In another 30 years, maybe, the development curve will have
settled down and the hardware & software & systems aspects will
have become standardized: then we'll know what the fieldlength
limits and relevance ranking parameters are, and standardization
of things like labels will be easier. The world will be a
less-interesting place, tho... It took Gutenberg's incunabula
world 50 years to get there...


Jack Kessler, kessler@well.com

ps. Do be sure to look at OAI / Open Archives Initiative --
http://www.openarchives.org/ -- for Dublin Core and other
relevant resource identifier projects. The immediate problem for
a local database, though, is marrying up to pc-based commercial
standards, and the "legacy" systems/software/users problem
mentioned above... both are going to get worse & more complicated
before they get better & simpler...


> Kate
>
> Kate Foster Boyd
> Digital Projects Librarian
> Thomas Cooper Library
> University of South Carolina
> 1322 Greene Street
> Columbia, SC 29208
> (803)-777-2249
----------------------------------------------------------------------
To post messages to DIGLIB, send messages to: diglib@infoserv.inist.fr

Manage your account, change subscription options, or visit the archive at:

     http://infoserv.inist.fr/wwsympa.fcgi/info/diglib

DIGLIB requires that subscribers login with a password to change
their list profiles. First time users can request a password from the
page above. Any questions can be directed to one of the list moderators.

To unsubscribe:
mailto:[conf->email]@[conf->host]?subject=sig%20[list->name]%20[user->email]
----------------------------------------------------------------------