The first china-united states library conference

IS THERE A TAO FOR THE ACCESS AND CONTROL OF DIGITAL RESOURCES?

BY LING-YUH W. ( MIKO) PATTIE

I like to begin with a true story that might help in putting things in proper perspective when we talk about information technology and libraries. A friend of mine took a seminar class on French Revolution at the University of Louisville in 1970's. At the end of the semester, the professor asked each of the class to make personal, conclusive remarks on how the Revolution has impacted on the history of Western civilization. Everyone jumped in, postulated, espoused and embellished the significance of this event in the landscape of democracy. When it came to Mr. Wu, a Chinese student who hardly talked throughout the semester, everyone was curious. He leaned back in his chair, crossed his arms, and after a dead silence, he finally said, " It is too soon to tell." This very typical, traditional Chinese long-view perspective, in my opinion, might keep us from being swept under and help us sort out where we have been, where we are, and where we are going in this mad, mad digital universe.

Have you ever wondered why Yahoo! hires catalogers to index and organize the elusive Internet content? Why would a computer scientist and an anthropologist proclaim that "in the information gathering business, the human touch and expertise are irreplaceable"?1 In that 6/4/96 piece from Christian Science Monitor titled "Put a Good Librarian, Not Software, in Driver's Seat", the authors talked about how librarians are more than technicians; how librarians can judge the reliability of sources; how librarians can read and weed out false drops; how librarians can act as guides to the information riches in cyberspace because they know their clients, they know the database content and indexing conventions and they know when to be discreet and when to be nosy. I don't think we could say it better. That's why I am not here to postulate if we have a role as an organizer of information. I am here to ask you to respect the past and to help create the future by using what we know to find a Tao for access and control of digital resources.

Our traditional information access and control models deal with content that is finite, packaged, bundled and is fixed in physical location. We painstakingly craft selection policies, hunt around for best bargains from vendors or publishers, purchase items and fit them into our physical container, the library stacks, microform cabinets, or CD-ROM boxes. Our access mechanisms could be vertical files for those flimsy pamphlets we don't want to catalog but are vital nevertheless, finding aids for those manuscripts and archival materials that defy item-level cataloging but requiring a locator device, card catalogs for those items that we mainstream into processing chain, and last but not least, the commercial abstracting and indexing (A 8 I) services that help us with the article-based access to journal literature.

The mainstream processing chain includes cataloging as its centerpiece because its output is the access tool, the catalog. We firmly believe that no matter how good the content is, it is absolutely of no value until someone makes use of it. So catalogers are trained to make certain that every conceivable access point to that piece in hand is provided in the bibliographic record. We make use of the Anglo-American Cataloging Rules (AACR 2) to describe the physical attributes of the item. We analyze the subject content of the item and give it subject headings using control vocabularies such as Library of Congress Subject Headings (LCSH), Medical Subject Headings (MeSH), Sears, etc. And, we survey the universe of knowledge to try to fit the item into its proper place by using classification schemes such as Dewey Decimal Classification, Universal Decimal Classification, or Library of Congress Classification System. We also try to make sure that each name or subject term is unique and its relationship to other names or subject terms in the catalog is expressed so that our clients can get both precision and comprehension in their search recalls. This is called "authority control", a process that is dear to our hearts but dreaded by library administrators due to its high cost.

So the access model is "they come and they get". When they don't get using our manual access tools, libraries play the mediation role by way of reference or interlibrary loan services. Deficiencies for this model is the requirement of having to be physically there in the library to use the access tools and the content itself.

And then MARC formats and MARC-based online catalogs become the change agents for this model. MARC formats give us content designators so we can squeeze descriptive and access data into the tags, subfields and indicators. Nobody denies that MARC is card-based and that we are still coping with issues such as multiple versions and how to incorporate new attributes for networked electronic content. The new fangled access tool, online public access catalog, OPAC, brings us flexible access points and remote access users. It also offers an item status tracking mechanism from pre-order to shelf availability. In the mean time, the commercial A & I services start producing CD-ROMs and magnetic files to replace printed indexes and abstracts. We either put them in reference department where print ones used to sit or on campus networks where they are accessed along with or via QPACs. This is when we start to link these A8 I tools to our print holdings using mechanisms such as ISSN (International Standard Serial Number) so our clients will know if we own the journals they find citations for - a definite improvement over their print predecessors.

So the access model is "they log on, they come and they get". The control mechanisms are still the same as those for the manual system but the access is much more enhanced. Online access to both OPACs and A & I services offers our clients indexes to our content wherever they are. And yet they still have to come to get the content.

And here comes the Internet to turn this upside down. The library, the physical container for the pre-packaged content, begins to offer its OPAC on the Net. A8 I services follow suit. Datafiles, programs, services, electronic journals, discussion lists, newsgroups start tumbling out of this interconnected global network into millions of desktops all over this world. This new electronic content,

unlike our CD-ROMs or magnetic files that have physical presence in our campus, is mutable, unbundled, and uncontainable. How can we use AACR 2 that makes use of the title page as the basis for description to describe an interactive image file? The Human Genome Project? Or, a Web page? Where in the MARC record should we indicate the image file format, whether it's JPEG or MPEG and how to download and view it? Are we going to classify these? How can we link the bibliographic record to the real thing? Thanks to the OCLC Internet Cataloging Project, most of these questions have been or are being answered. In trying out the traditional cataloging tools with Internet resources, this project not only gives us the field experience but it also gives us the 856 field as a linking mechanism between the record and the digital object. One fundamental question still stands, though, as the project pulls to a close - why do we want to provide OPAC access to the Internet content while Lycos, Alta Vista, WebCrawler, Yahoo are already doing that?

Well, for those of you surfing on the Net looking for specific information, it's no exaggeration to say that it's like looking for needles in the haystack, As Peter Morville, in his 5/17/96 Web Review article titled "Revenge of the Librarians", so aptly stated: " ...to someone with a hammer in his hand, all problems look like nails. Well, on the Internet, computer science folks tend to hit information retrieval problems over the head with relevance ranking algorithms, intelligent agents, and lightning quick processors".2 We, cybrarians, on the other hand, would like to hit information retrieval problems with controlled vocabularies, authority control, and classification schemes. With Internet resources doubling every 6 to 8 months, can we afford to sit in the library and tell our clients to surf at their own risk?

So now the access model is "they log on, they search or surf, they get or get lost, they come and they get." When they search our OPACs and A & I databases via Internet, they rely on our and vendors' consistent, pre-coordinated and syndetic structure of control vocabularies. The major departure from the previous model is that in some cases they don't even need to come for the content. Full text databases and patron-initiated interlibrary loans are the new enabling agents to link our clients to the content. As libraries struggle with the access and ownership issue for their collections, further integration of OPACs and A& I and full text databases will drive the designs for MARC-based library systems.

So on one track we have Internet accessible OPACs, A & I and some full text databases, then we have these Internet search engines for the other track. This dual track access model for the two parallel universes defines a critical moment for us as organizers of information. How can we integrate these two and provide our clients a seamless access to both print and electronic content?

As we witness the explosive growth and sweeping acceptance of the World Wide Web as the de facto standard for Internet resources, Web-based OPACs and A&I and full text services have made their entrance into the market during this past year to provide the much-needed unified user interface for the dual track access. So now our clients can use a Web browser, click on the OPAC or ALI and full text services, or click on an Internet search tool, to open up the print or digital universe. This last-mile solution signifies a turning point where these two tracks do meet and have the same look and feel. But, are we there yet? Not yet. But we are getting there, slowly but steadily. I would like to talk about some of the emerging movements that will help us getting there in building a Tao for the access and control of digital resources.

Let's look at the major producers of digital resources: publishers, authors, and libraries. Of these, libraries are a relatively new comer. Digitizing primary source materials has become a new focus for libraries for it serves dual purposes: it limits physical use of the item and enhances its access. While the selection of target collections is of primary importance, the adoption of a texual markup language will prove to be critical for access and control. The slow and steady evolving of Standard Generalized Markup Language (SGML) into a national and international standard during the past 26 years paves the way for a unified, global approach for access and control of electronic texts. As a metalanguage, a language that describes other markup languages, SGML sets up a framework for descriptions of various types of documents. These descriptions are known as Document Type Definitions (DTDs). This reminds me of the international bibliographic information interchange standard, ISO2709 or 239.2. It only specifies a general structure of a MARC record and USMARC, UKMARC, CHINAMARC, etc. all spring forth from that. The use of SGML DTDs for electronic texts ensures portability independent of applications and subsequent reuse just as MARC formats function for bibliographic records. One of the most well-known SGML DTDs is Hypertext Markup Language (HTML), the markup used by the Web. It allows users to embed images, sounds, and videos and to provide hyperlinks among documents and objects. However, "the markup is essentially very simple and only defines a document's structure at a very basic level".3 The popularity of this application generates many specifications and revisions to this DTD with various browsers trying to interpret them and vying for the marketshare. Let's hope the W3 Consortium can contain and coordinate these developments.

The Web-based OPACs do use HTML partially. What I mean is that a Web browser can access a Web gateway software to search MARC records contained in OPACs and then the software creates HTML on the fly and returns it to the client. HTML serves as the user interface patch to the MARC-based OPACs. So we are still tied to the MARC record with its limitations in accommodating complex hierarchical relationships and evaluative, analytical information. These limitations loom large for digitized texts as users need this type of information to make use of the e-texts. Then, what is the solution? The following projects might give us some hints for they have forged ahead to test the use of SGML as the structural base for the access and control of digital resources:

1. The Text Encoding Initiative

TEI is one of the most ambitious and far-reaching initiatives to use SGML for the generation and exchange of e-texts for scholarly research on an international level. Its Guidelines for Electronic Text Encoding and Interchange provides " a common encoding scheme for complex textual structures in order to reduce the diversity of existing encoding practices, simplify processing by machine, and encourage the sharing of electronic texts." 4 The required TEI header contains metadata, data on data, on the e-text that follows. TEI deliberately rejected using MARC to accommodate the heading information because most of this data would have to be squeezed into the 500 fields and thus would lose the structure for access and use. The header includes a file description, an encoding description, a text profile description, and a revision history description. Of these 4 elements, the file description resembles most the cataloging record - it contains title, author, creator and publisher of electronic version, file size, and the print source. This is where some e-text centers have already tapped on to create MARC records while encoding TEI headers. For example, The University of Virginia's Electronic Text Center could serve as a good model for libraries still deliberating taking on such a task. Its director, David Seaman, explained its guiding principles this way, "Central to our selection criteria is the desire for software and plafform-independent texts - if it's not SGML, it's ephemeral - and central to our catloging endeavors is an SGML bibliographic record such as the Text Encoding Initiative header."5 One important issue in creating MARC from TEI headers at Virginia is the adoption of authority control for headings to be contained in TEI headers. You see, there is no guidelines for that in TEI. In addition, they also wrote a PERL script to convert SGML to HTML on the fly so it "allows us (them) to have Web access without reducing the markup of our (their) data to the rather impoverished level of HTML."6

2. Berkeley Finding Aid Project

BFAP aims to create a prototype encoding scheme and build a prototype database for archive, museum and library finding aids. As an intermediate access tool between a collection-level MARC record and primary source materials, finding aids serve as essential metadata for collections of related materials. Again, SGML was chosen over MARC for its potential of accommodating hierarchical and interrelated levels among collections, series, units, and items. Daniel Pitti and his team produced encoding standard design principles and laid the groundwork for developing a new SGML DTD named Encoded Archival Description or EAD. Library of Congress Network Development and MARC Standards Office would be the maintenance agency for EAD once it is endorsed as a standard by the archival community through the Society of American Archivists. This development signifies a profession-wide recognition that there is no better way to provide access and control of digital finding aids than to establish a standard and to collaboratively build a database of standard-conformant records.

3. Columbia University Digital Image Access Project

DIAP is a Research Libraries Group initiative on access to digital images and Columbia University has set out to develop a new model to contain both bibliographic and analytical data on digital images. "The proposed data model could readily be "housed" in an SGML-encoded bibliographic (metadata) record that encapsulates both summary bibliographic information along with detailed hierarchical and version-related data, when such data is appropriate and considered useful to record. The record would also include links to the actual digital items, to other related bibliographic records or, in fact, to diverse, related digital objects (such as external electronic publications, databases, numeric files, etc.) The working designation SGML Catalog Record (SCR) is proposed for this new type of record."7 Again, the flat MARC record structure is bypassed for the richer, flexible SGML encoding scheme. Yet, in order to operate in a MARC-based world, Columbia proposed to link MARC records to SCRs so the enriched SGML-encoded data would not be lost to users.

All these 3 projects are compelling models for libraries to rethink the role of MARC-based OPACs in the access and control of digital resources. If we, as creators and organizers of these digitized resources, cannot integrate their access into our OPACs, then we will have to recreate a new infrastructure to accommodate them. After all, these are under our control - we either select and purchase them or we digitize them. The recent news of LC's Network Development and MARC Standards Office, along with the USMARC standards body, MARBI, taking on the task of developing a USMARC DTD is very encouraging. The prospect of building a shared database of SGML-encoded records and providing an online library system for the hyperlinked access to the bibliographic, analytical and evaluative information on digital resources does not seem to be too far-fetched to me. I know. It is too soon to tell!

Now I would like to turn our attention to those Internet resources that's not under our control. "The most important point to remember is that," as Peter Graham, one of the most clear-headed among us, stated,"from the users' point of view, the catalog is only one of a repertoire of tools."8 We know to catalog the entire Internet content is not humanly possible and to effect the producers of Internet tools to utilize our traditional control mechanisms is probably next to impossible. However, there are some Internet search tools that use classification schemes, such as, CyberDewey using Dewey, CyberStacks using Library of Congress, and BUBL using UDC. On the other hand, Yahoo's catalogers decided to use their own subject system after rejecting Dewey, LC and LCSH. Now they are finding out the subject trees are getting too heavy and are subject to a total collapse. Simply put, what we are all searching for are standards for creating record surrogates for item description, access and use, the kind of metadata standards that could be adopted and used globally for the digital universe, no matter whether it is bibliographic, scientific, technical or geospatial information. TEI header is a good example for such a standard for e-texts as the Federal Geographic Data Committee (FGDC) standard is for geospatial information. Yet, in order to enable creators and publishers from all user communities to embed metadata in the digital resources, data elements would have to be transparent, generic, and easy to supply. Thus was born the Dublin Core Element Set.

The OCLC/NCSA Metadata Workshop, held in Dublin, Ohio, in March of 1995, intended to reach some kind of consensus on a limited data element set for the discovery and retrieval of Internet resources. Participants came from various stakeholders - researchers, software developers, computer and information scientists, publishers, librarians, archivists, and members of the Internet Engineering Task Force (IETF) Uniform Resource Identifiers (URI) Working Group. Can you imagine putting all these people in one room for 2 days and expecting them to come to an agreement on a common set? Well, they did it. This common set of elements would have 4 uses: "...It would encourage authors and publishers to provide metadata along with their data. It would allow developers...to include templates for this information directly in their software. Metadata...could serve as a basis for more detailed cataloging or description when warranted by specific communities...the standard would ensure that a common core set of elements could be understood across communities..."9 The resultant set includes 13 data elements. Yet, simplicity brings forth ambiguity. This core is of limited usefulness unless it can be further defined and formally linked to more descriptive records containing details specific to a user community. For the library community, these data elements for the most part can be mapped to USMARC although guidelines would need to be instituted to ensure portability. This would mean that we could extend and enhance such metadata in digital objects and integrate them into QPACs.

This calls for testbeds from all stakeholders to bridge the Dublin Core element set to their own agreed-upon element set for a possible interoperable mapping scheme. l understand that OCLC's SPECTRUM system is testing if a mapping is possible from a user input of Dublin Core data elements in HTML to TEI headers and MARC records. Let's hope that more of this is to come.

When we talk about metadata, we will need to look at the alphabet soup: Uniform Resource Locator (URL), Uniform Resource Name (URN), and Uniform Resource Characteristic (URC). Of these 3, URL is most mature. URN is intended to be used to resolve the problems with URL's instability. It is to be unique, persistent, and location-independent so it could serve as a link between URL and the object. Several naming conventions are under consideration: OCLC's Persistent Uniform Resource Locator (PURL) and CNRI's Handle server being among them. The result could be a combination of both. Now URC is a type of metadata that intends to insure machine retrieval and authentication of resources, and has access restriction capability. There are currently 10 data elements to guarantee discovery and retrieval of digital objects if they are consistently applied. URC is to serve as a link for both URL and URN so that a user could access URC to retrieve URN and then to retrieve URL to get the object itself.

I recently learned from Stu Weibel of OCLC at ALA conference in New York that the W3 Consortium has actually reached consensus on a systax for embedding metadata in digital objects and a draft for link tags (metatags) to serve as a bridge among different schemes. Will the day for us to see a robust metadata framework for Internet resources be far behind? Once the metadata schemes are properly mapped and linked, structural search tools could then be constructed to facilitate browsing. But, are we there yet? Not quite. We still have the character set issue to resolve and we hope the implementation of the 16-bit Unicode might alleviate most of the problems we now encounter in a multilingual environment. And we still have the thesauri mapping issue that just wouln't go away. If only we had a common vocabulary to serve as the bridge, we will be there!

So, you might ask, as an organizer of information, what is our role at this critical juncture? For one thing, we can do a better job in selecting Internet resources and integrating them into our OPACs. Selection policies allow us to sift through the noise and bring objects of value into our digital content for our clients. We could then utilize our traditional access and control mechanisms to provide the same kind of access as that for the print content. Library of Congress recently announced a pilot project for June through December of this year to select and catalog Internet resources in the area of business and economics. BeOnline will provide a testbed to address issues such as, tools needed to be developed to integrate networked information, criteria needed to be established to determine which resources warrant cataloging or homepage listing, the adequacy of traditional tools for description, URL resolutor, and, ways to reduce cost of cataloging such resources. I am glad that LC is taking a leadership role in this venture.

Next, we need to work with document creators or, better still, to be creators ourselves to ensure that digital resources are SGML-encoded with embedded metadata that can be incorporated into our MARC-based OPACs. We need to get involved with the developing and testing of SGML DTDs for the value-added content under our stewardship. We need to be active participants in the deliberations of metadata schemes such as URC. We need to think globally and act locally to bring about the integration of our traditional access and control mechanisms into the new metaaccess system. And, finally, we need to work with library system designers on requirements for an SGML-based online system.

REFERENCES:

1. http://www.csmonitor.com/plweb-cgi/idoc. pl? 137709+unix+_free_user_+www.csmonitor.corn ..80+paper+paper+archives+archives++librarians

2. http://gnn. com/gnn/wr/96/05/17/webarch/index.html

3. Edward Gaynor, "From MARC to markup: SGML and online library systems," From Catalog to Gateway, Briefings from the CFFC, no.7 in ALCTS Newsletter, vol.7, no.2, 1996.

4. C.M. Sperberg-McQueen and Lou Bernard. Guidelines for Electronic Text Encoding and Interchange. Chicago: Text Encoding Initiative, 1994. (http://etext. virginia.edu/TEl.html)

5. David M. Seaman, "Selection, Access, and Control in a Library of Electronic Texts," Cataloging and Classification Quarterly, vol. 22, 3/4, 1996.

6. Ibid, 5.

7. Davis, Steven Paul."Digital Image Collections: Cataloging Data Model & Network Access."
http://www.cc. columbia.edu/cu/libraries/inside/projects/diap/paper.html

8. Peter S. Graham, "The mid-decade catalog," From Catalog to Gateway, Briefings from the CFFC, no.1 in ALCTS Newsletter, vol 5, 1994.

9. Priscilla Caplan and Rebecca Guenther, "Metadata for Internet Resources: The Dublin Core Metadata Elements Set and Its Mapping to USMARC," Cataloging and Classification Quarterly, vol.22, 3/4, 1996. P>


conference papers conference reports Images WWW links