I like to begin with a true story that might help in putting things in proper perspective when we talk about information technology and libraries. A friend of mine took a seminar class on French Revolution at the University of Louisville in 1970's. At the end of the semester, the professor asked each of the class to make personal, conclusive remarks on how the Revolution has impacted on the history of Western civilization. Everyone jumped in, postulated, espoused and embellished the significance of this event in the landscape of democracy. When it came to Mr. Wu, a Chinese student who hardly talked throughout the semester, everyone was curious. He leaned back in his chair, crossed his arms, and after a dead silence, he finally said, " It is too soon to tell." This very typical, traditional Chinese long-view perspective, in my opinion, might keep us from being swept under and help us sort out where we have been, where we are, and where we are going in this mad, mad digital universe.
Have you ever wondered why Yahoo! hires catalogers to index and organize the elusive Internet content? Why would a computer scientist and an anthropologist proclaim that "in the information gathering business, the human touch and expertise are irreplaceable"?1 In that 6/4/96 piece from Christian Science Monitor titled "Put a Good Librarian, Not Software, in Driver's Seat", the authors talked about how librarians are more than technicians; how librarians can judge the reliability of sources; how librarians can read and weed out false drops; how librarians can act as guides to the information riches in cyberspace because they know their clients, they know the database content and indexing conventions and they know when to be discreet and when to be nosy. I don't think we could say it better. That's why I am not here to postulate if we have a role as an organizer of information. I am here to ask you to respect the past and to help create the future by using what we know to find a Tao for access and control of digital resources.
Our traditional information access and control models deal with content that is finite, packaged, bundled and is fixed in physical location. We painstakingly craft selection policies, hunt around for best bargains from vendors or publishers, purchase items and fit them into our physical container, the library stacks, microform cabinets, or CD-ROM boxes. Our access mechanisms could be vertical files for those flimsy pamphlets we don't want to catalog but are vital nevertheless, finding aids for those manuscripts and archival materials that defy item-level cataloging but requiring a locator device, card catalogs for those items that we mainstream into processing chain, and last but not least, the commercial abstracting and indexing (A 8 I) services that help us with the article-based access to journal literature.
The mainstream processing chain includes cataloging as its centerpiece because its output is the access tool, the catalog. We firmly believe that no matter how good the content is, it is absolutely of no value until someone makes use of it. So catalogers are trained to make certain that every conceivable access point to that piece in hand is provided in the bibliographic record. We make use of the Anglo-American Cataloging Rules (AACR 2) to describe the physical attributes of the item. We analyze the subject content of the item and give it subject headings using control vocabularies such as Library of Congress Subject Headings (LCSH), Medical Subject Headings (MeSH), Sears, etc. And, we survey the universe of knowledge to try to fit the item into its proper place by using classification schemes such as Dewey Decimal Classification, Universal Decimal Classification, or Library of Congress Classification System. We also try to make sure that each name or subject term is unique and its relationship to other names or subject terms in the catalog is expressed so that our clients can get both precision and comprehension in their search recalls. This is called "authority control", a process that is dear to our hearts but dreaded by library administrators due to its high cost.
So the access model is "they come and they get". When they don't get using our manual access tools, libraries play the mediation role by way of reference or interlibrary loan services. Deficiencies for this model is the requirement of having to be physically there in the library to use the access tools and the content itself.
And then MARC formats and MARC-based online catalogs become the change agents for this model. MARC formats give us content designators so we can squeeze descriptive and access data into the tags, subfields and indicators. Nobody denies that MARC is card-based and that we are still coping with issues such as multiple versions and how to incorporate new attributes for networked electronic content. The new fangled access tool, online public access catalog, OPAC, brings us flexible access points and remote access users. It also offers an item status tracking mechanism from pre-order to shelf availability. In the mean time, the commercial A & I services start producing CD-ROMs and magnetic files to replace printed indexes and abstracts. We either put them in reference department where print ones used to sit or on campus networks where they are accessed along with or via QPACs. This is when we start to link these A8 I tools to our print holdings using mechanisms such as ISSN (International Standard Serial Number) so our clients will know if we own the journals they find citations for - a definite improvement over their print predecessors.
So the access model is "they log on, they come and they get". The control mechanisms are still the same as those for the manual system but the access is much more enhanced. Online access to both OPACs and A & I services offers our clients indexes to our content wherever they are. And yet they still have to come to get the content.
And here comes the Internet to turn this upside down. The library, the physical container for the pre-packaged content, begins to offer its OPAC on the Net. A8 I services follow suit. Datafiles, programs, services, electronic journals, discussion lists, newsgroups start tumbling out of this interconnected global network into millions of desktops all over this world. This new electronic content,
unlike our CD-ROMs or magnetic files that have physical presence in our campus, is mutable, unbundled, and uncontainable. How can we use AACR 2 that makes use of the title page as the basis for description to describe an interactive image file? The Human Genome Project? Or, a Web page? Where in the MARC record should we indicate the image file format, whether it's JPEG or MPEG and how to download and view it? Are we going to classify these? How can we link the bibliographic record to the real thing? Thanks to the OCLC Internet Cataloging Project, most of these questions have been or are being answered. In trying out the traditional cataloging tools with Internet resources, this project not only gives us the field experience but it also gives us the 856 field as a linking mechanism between the record and the digital object. One fundamental question still stands, though, as the project pulls to a close - why do we want to provide OPAC access to the Internet content while Lycos, Alta Vista, WebCrawler, Yahoo are already doing that?
Well, for those of you surfing on the Net looking for specific information, it's no exaggeration to say that it's like looking for needles in the haystack, As Peter Morville, in his 5/17/96 Web Review article titled "Revenge of the Librarians", so aptly stated: " ...to someone with a hammer in his hand, all problems look like nails. Well, on the Internet, computer science folks tend to hit information retrieval problems over the head with relevance ranking algorithms, intelligent agents, and lightning quick processors".2 We, cybrarians, on the other hand, would like to hit information retrieval problems with controlled vocabularies, authority control, and classification schemes. With Internet resources doubling every 6 to 8 months, can we afford to sit in the library and tell our clients to surf at their own risk?
So now the access model is "they log on, they search or surf, they get or get lost, they come and they get." When they search our OPACs and A & I databases via Internet, they rely on our and vendors' consistent, pre-coordinated and syndetic structure of control vocabularies. The major departure from the previous model is that in some cases they don't even need to come for the content. Full text databases and patron-initiated interlibrary loans are the new enabling agents to link our clients to the content. As libraries struggle with the access and ownership issue for their collections, further integration of OPACs and A& I and full text databases will drive the designs for MARC-based library systems.
So on one track we have Internet accessible OPACs, A & I and some full text databases, then we have these Internet search engines for the other track. This dual track access model for the two parallel universes defines a critical moment for us as organizers of information. How can we integrate these two and provide our clients a seamless access to both print and electronic content?
As we witness the explosive growth and sweeping acceptance of the World Wide Web as the de facto standard for Internet resources, Web-based OPACs and A&I and full text services have made their entrance into the market during this past year to provide the much-needed unified user interface for the dual track access. So now our clients can use a Web browser, click on the OPAC or ALI and full text services, or click on an Internet search tool, to open up the print or digital universe. This last-mile solution signifies a turning point where these two tracks do meet and have the same look and feel. But, are we there yet? Not yet. But we are getting there, slowly but steadily. I would like to talk about some of the emerging movements that will help us getting there in building a Tao for the access and control of digital resources.
Let's look at the major producers of digital resources: publishers, authors, and libraries. Of these, libraries are a relatively new comer. Digitizing primary source materials has become a new focus for libraries for it serves dual purposes: it limits physical use of the item and enhances its access. While the selection of target collections is of primary importance, the adoption of a texual markup language will prove to be critical for access and control. The slow and steady evolving of Standard Generalized Markup Language (SGML) into a national and international standard during the past 26 years paves the way for a unified, global approach for access and control of electronic texts. As a metalanguage, a language that describes other markup languages, SGML sets up a framework for descriptions of various types of documents. These descriptions are known as Document Type Definitions (DTDs). This reminds me of the international bibliographic information interchange standard, ISO2709 or 239.2. It only specifies a general structure of a MARC record and USMARC, UKMARC, CHINAMARC, etc. all spring forth from that. The use of SGML DTDs for electronic texts ensures portability independent of applications and subsequent reuse just as MARC formats function for bibliographic records. One of the most well-known SGML DTDs is Hypertext Markup Language (HTML), the markup used by the Web. It allows users to embed images, sounds, and videos and to provide hyperlinks among documents and objects. However, "the markup is essentially very simple and only defines a document's structure at a very basic level".3 The popularity of this application generates many specifications and revisions to this DTD with various browsers trying to interpret them and vying for the marketshare. Let's hope the W3 Consortium can contain and coordinate these developments.
The Web-based OPACs do use HTML partially. What I mean is that a Web browser can access a Web gateway software to search MARC records contained in OPACs and then the software creates HTML on the fly and returns it to the client. HTML serves as the user interface patch to the MARC-based OPACs. So we are still tied to the MARC record with its limitations in accommodating complex hierarchical relationships and evaluative, analytical information. These limitations loom large for digitized texts as users need this type of information to make use of the e-texts. Then, what is the solution? The following projects might give us some hints for they have forged ahead to test the use of SGML as the structural base for the access and control of digital resources:
1. The Text Encoding Initiative
TEI is one of the most ambitious and far-reaching
initiatives to use SGML for the generation and exchange of e-texts
for scholarly research on an international level. Its Guidelines
for Electronic Text Encoding and Interchange provides " a
common encoding scheme for complex textual structures in order
to reduce the diversity of existing encoding practices, simplify
processing by machine, and encourage the sharing of electronic
texts." 4 The required TEI header contains metadata, data
on data, on the e-text that follows. TEI deliberately rejected
using MARC to accommodate the heading information because most
of this data would have to be squeezed into the 500 fields and
thus would lose the structure for access and use. The header includes
a file description, an encoding description, a text profile description,
and a revision history description. Of these 4 elements, the file
description resembles most the cataloging record - it contains
title, author, creator and publisher of electronic version, file
size, and the print source. This is where some e-text centers
have already tapped on to create MARC records while encoding TEI
headers. For example, The University of Virginia's Electronic
Text Center could serve as a good model for libraries still deliberating
taking on such a task. Its director, David Seaman, explained its
guiding principles this way, "Central to our selection criteria
is the desire for software and plafform-independent texts - if
it's not SGML, it's ephemeral - and central to our catloging endeavors
is an SGML bibliographic record such as the Text Encoding Initiative
header."5 One important issue in creating MARC from TEI headers
at Virginia is the adoption of authority control for headings
to be contained in TEI headers. You see, there is no guidelines
for that in TEI. In addition, they also wrote a PERL script to
convert SGML to HTML on the fly so it "allows us (them) to
have Web access without reducing the markup of our (their) data
to the rather impoverished level of HTML."6
2. Berkeley Finding Aid Project
BFAP aims to create a prototype encoding scheme
and build a prototype database for archive, museum and library
finding aids. As an intermediate access tool between a collection-level
MARC record and primary source materials, finding aids serve as
essential metadata for collections of related materials. Again,
SGML was chosen over MARC for its potential of accommodating hierarchical
and interrelated levels among collections, series, units, and
items. Daniel Pitti and his team produced encoding standard design
principles and laid the groundwork for developing a new SGML DTD
named Encoded Archival Description or EAD. Library of Congress
Network Development and MARC Standards Office would be the maintenance
agency for EAD once it is endorsed as a standard by the archival
community through the Society of American Archivists. This development
signifies a profession-wide recognition that there is no better
way to provide access and control of digital finding aids than
to establish a standard and to collaboratively build a database
of standard-conformant records.
3. Columbia University Digital Image Access
Project
DIAP is a Research Libraries Group initiative
on access to digital images and Columbia University has set out
to develop a new model to contain both bibliographic and analytical
data on digital images. "The proposed data model could readily
be "housed" in an SGML-encoded bibliographic (metadata)
record that encapsulates both summary bibliographic information
along with detailed hierarchical and version-related data, when
such data is appropriate and considered useful to record. The
record would also include links to the actual digital items, to
other related bibliographic records or, in fact, to diverse, related
digital objects (such as external electronic publications, databases,
numeric files, etc.) The working designation SGML Catalog Record
(SCR) is proposed for this new type of record."7 Again, the
flat MARC record structure is bypassed for the richer, flexible
SGML encoding scheme. Yet, in order to operate in a MARC-based
world, Columbia proposed to link MARC records to SCRs so the enriched
SGML-encoded data would not be lost to users.
All these 3 projects are compelling models
for libraries to rethink the role of MARC-based OPACs in the access
and control of digital resources. If we, as creators and organizers
of these digitized resources, cannot integrate their access into
our OPACs, then we will have to recreate a new infrastructure
to accommodate them. After all, these are under our control -
we either select and purchase them or we digitize them. The recent
news of LC's Network Development and MARC Standards Office, along
with the USMARC standards body, MARBI, taking on the task of developing
a USMARC DTD is very encouraging. The prospect of building a shared
database of SGML-encoded records and providing an online library
system for the hyperlinked access to the bibliographic, analytical
and evaluative information on digital resources does not seem
to be too far-fetched to me. I know. It is too soon to tell!
Now I would like to turn our attention to those
Internet resources that's not under our control. "The most
important point to remember is that," as Peter Graham, one
of the most clear-headed among us, stated,"from the users'
point of view, the catalog is only one of a repertoire of tools."8
We know to catalog the entire Internet content is not humanly
possible and to effect the producers of Internet tools to utilize
our traditional control mechanisms is probably next to impossible.
However, there are some Internet search tools that use classification
schemes, such as, CyberDewey using Dewey, CyberStacks using Library
of Congress, and BUBL using UDC. On the other hand, Yahoo's catalogers
decided to use their own subject system after rejecting Dewey,
LC and LCSH. Now they are finding out the subject trees are getting
too heavy and are subject to a total collapse. Simply put, what
we are all searching for are standards for creating record surrogates
for item description, access and use, the kind of metadata standards
that could be adopted and used globally for the digital universe,
no matter whether it is bibliographic, scientific, technical or
geospatial information. TEI header is a good example for such
a standard for e-texts as the Federal Geographic Data Committee
(FGDC) standard is for geospatial information. Yet, in order to
enable creators and publishers from all user communities to embed
metadata in the digital resources, data elements would have to
be transparent, generic, and easy to supply. Thus was born the
Dublin Core Element Set.
The OCLC/NCSA Metadata Workshop, held in Dublin,
Ohio, in March of 1995, intended to reach some kind of consensus
on a limited data element set for the discovery and retrieval
of Internet resources. Participants came from various stakeholders
- researchers, software developers, computer and information scientists,
publishers, librarians, archivists, and members of the Internet
Engineering Task Force (IETF) Uniform Resource Identifiers (URI)
Working Group. Can you imagine putting all these people in one
room for 2 days and expecting them to come to an agreement on
a common set? Well, they did it. This common set of elements would
have 4 uses: "...It would encourage authors and publishers
to provide metadata along with their data. It would allow developers...to
include templates for this information directly in their software.
Metadata...could serve as a basis for more detailed cataloging
or description when warranted by specific communities...the standard
would ensure that a common core set of elements could be understood
across communities..."9 The resultant set includes 13 data
elements. Yet, simplicity brings forth ambiguity. This core is
of limited usefulness unless it can be further defined and formally
linked to more descriptive records containing details specific
to a user community. For the library community, these data elements
for the most part can be mapped to USMARC although guidelines
would need to be instituted to ensure portability. This would
mean that we could extend and enhance such metadata in digital
objects and integrate them into QPACs.
This calls for testbeds from all stakeholders
to bridge the Dublin Core element set to their own agreed-upon
element set for a possible interoperable mapping scheme. l understand
that OCLC's SPECTRUM system is testing if a mapping is possible
from a user input of Dublin Core data elements in HTML to TEI
headers and MARC records. Let's hope that more of this is to come.
When we talk about metadata, we will need to
look at the alphabet soup: Uniform Resource Locator (URL), Uniform
Resource Name (URN), and Uniform Resource Characteristic (URC).
Of these 3, URL is most mature. URN is intended to be used to
resolve the problems with URL's instability. It is to be unique,
persistent, and location-independent so it could serve as a link
between URL and the object. Several naming conventions are under
consideration: OCLC's Persistent Uniform Resource Locator (PURL)
and CNRI's Handle server being among them. The result could be
a combination of both. Now URC is a type of metadata that intends
to insure machine retrieval and authentication of resources, and
has access restriction capability. There are currently 10 data
elements to guarantee discovery and retrieval of digital objects
if they are consistently applied. URC is to serve as a link for
both URL and URN so that a user could access URC to retrieve URN
and then to retrieve URL to get the object itself.
I recently learned from Stu Weibel of OCLC
at ALA conference in New York that the W3 Consortium has actually
reached consensus on a systax for embedding metadata in digital
objects and a draft for link tags (metatags) to serve as a bridge
among different schemes. Will the day for us to see a robust metadata
framework for Internet resources be far behind? Once the metadata
schemes are properly mapped and linked, structural search tools
could then be constructed to facilitate browsing. But, are we
there yet? Not quite. We still have the character set issue to
resolve and we hope the implementation of the 16-bit Unicode might
alleviate most of the problems we now encounter in a multilingual
environment. And we still have the thesauri mapping issue that
just wouln't go away. If only we had a common vocabulary to serve
as the bridge, we will be there!
So, you might ask, as an organizer of information,
what is our role at this critical juncture? For one thing, we
can do a better job in selecting Internet resources and integrating
them into our OPACs. Selection policies allow us to sift through
the noise and bring objects of value into our digital content
for our clients. We could then utilize our traditional access
and control mechanisms to provide the same kind of access as that
for the print content. Library of Congress recently announced
a pilot project for June through December of this year to select
and catalog Internet resources in the area of business and economics.
BeOnline will provide a testbed to address issues such as, tools
needed to be developed to integrate networked information, criteria
needed to be established to determine which resources warrant
cataloging or homepage listing, the adequacy of traditional tools
for description, URL resolutor, and, ways to reduce cost of cataloging
such resources. I am glad that LC is taking a leadership role
in this venture.
Next, we need to work with document creators
or, better still, to be creators ourselves to ensure that digital
resources are SGML-encoded with embedded metadata that can be
incorporated into our MARC-based OPACs. We need to get involved
with the developing and testing of SGML DTDs for the value-added
content under our stewardship. We need to be active participants
in the deliberations of metadata schemes such as URC. We need
to think globally and act locally to bring about the integration
of our traditional access and control mechanisms into the new
metaaccess system. And, finally, we need to work with library
system designers on requirements for an SGML-based online system.
REFERENCES:
2. http://gnn. com/gnn/wr/96/05/17/webarch/index.html
3. Edward Gaynor, "From MARC to markup: SGML and online library systems," From Catalog to Gateway, Briefings from the
CFFC, no.7 in ALCTS Newsletter, vol.7, no.2, 1996.
4. C.M. Sperberg-McQueen and Lou Bernard. Guidelines
for Electronic Text Encoding and Interchange. Chicago: Text Encoding
Initiative, 1994.
(http://etext. virginia.edu/TEl.html)
5. David M. Seaman, "Selection, Access, and Control in a Library of Electronic Texts," Cataloging and Classification Quarterly, vol. 22, 3/4, 1996.
6. Ibid, 5.
7. Davis, Steven Paul."Digital Image Collections:
Cataloging Data Model & Network Access."
http://www.cc. columbia.edu/cu/libraries/inside/projects/diap/paper.html
8. Peter S. Graham, "The mid-decade catalog," From Catalog to Gateway, Briefings from the CFFC, no.1 in ALCTS Newsletter, vol 5, 1994.
9. Priscilla Caplan and Rebecca Guenther, "Metadata for Internet Resources: The Dublin Core Metadata Elements Set and Its Mapping to USMARC," Cataloging and Classification Quarterly, vol.22, 3/4, 1996. P>