ETD Guide/Technical Issues/Metadata models for ETDs

One of the objectives of an ETD program is to yield easy access to TDs. Since we are dealing with digital libraries, we are implicitly dealing with libraries. One of the actions performed on a library catalog is that of search and retrieve. This is the first step towards accessing the contents of a library item; the second step is the use (read, listen, view, etc.) of the item.

In order to be efficient in the search and retrieve action, the user must search a catalog in which the items were properly identified, besides using good search functions.

This section is about the identification of ETD's, which is a very important step towards their dissemination. The identification will be accomplished through the use of the metadata elements whose set is named the metadata model of the digital library of TDs.

Before we address metadata models for ETDs, it is important that some ideas are brought to the discussion. These ideas are related to the choice of a model to be considered later on. These models must be rich and versatile to contain information of different natures and to be searched by users from all over the world.

It is obvious that the richer and more versatile the metadata model is, the more time and effort it takes to capture (collect and record) the information into the digital library. The decision on which model to use will have to take this into consideration. In some situations it may be necessary to adopt the simplest possible model in order to make the metadata capture viable. Later in this chapter the Dublin Core Metadata Element Set will be introduced. It seems that it is the consensus of the minimum identification to be used for ETD's.

The ideas for us to think about are:

Many languages in one world
ETDs to be read all over the world
Metadata
Contents and instances
Contents, instances and metadata
Contents, instances and languages
Metadata models and languages
Metadata schemes
Specialization of the metadata models for TDs
Conclusion - metadata models for ETDs

Many languages in one world

Our world is a very diverse linguistic place. Those who work with information and are involved in international projects know English. This is the language they use to communicate, to access the Internet, to read technical literature, etc.

At the same time, not only many other languages exist but some of them have large numbers of native speakers. The 100 most spoken languages of the world, when first language speakers are counted, can be found in http://www.sil.org/ethnologue/top100.html. In descending order, the first 10 are Chinese (Mandarin), Spanish, English, Bengali, Hindi, Portuguese, Russian, Japanese, German (Standard) and Chinese (Wu).

If only the other 9 languages are considered, it is not hard to imagine the numbers of texts that are written and published every year. The same happens with TDs. The number of TDs published in languages other than English must be very big.

ETD's to be read all over the world

One of the purposes and benefits of an ETD program is to yield easy access to the results presented in TDs, no matter where the reader is and where the dissertation was written.

We assume that ETD digital libraries are to be connected to the Internet so that their contents can be shared worldwide, to make sure this benefit is accomplished.

Metadata

Metadata are data about data or information about information.

The metadata elements are the attributes used to describe a digital library item just like the ones used to catalog items in a traditional library.

Many of these attributes are language dependent, as for example titles, abstracts, subjects, keywords, etc. Others obviously are not, as for example authors' names, digital format, number of bytes of the file, etc.

Since some metadata elements are language dependent and TDs are written in many languages, we can expect that most probably the metadata will use the language of the work. This can pose a problem for search and retrieve activities since most of us are not fluent in as many languages as we would like to be.

Contents and instances

The items of a digital library may be identified in 2 different levels; the same way the items of a traditional library are. The first level is the content which is equivalent to a title of a traditional library and the second is the instance which is equivalent to a volume.

A content is the logical definition of an item of the digital library and it is identified by a set of attributes. An instance is the physical realization of a content or title. It is a digital object and is identified by a set of attributes too.

The use of contents and instances allows contents to have multiple instances either in different formats or due to physical partitions. This will yield a one to many relationship among contents and instances.

The use of contents and instances also allows the access control to be performed on the partitions instead of on the content. This makes the digital library more flexible in terms of dealing with intellectual property rights.

Therefore, we can conclude that there are attributes that are particular to contents and others that refer to instances. The metadata model must contain both.

Contents, instances and metadata

Some metadata elements are common to all contents, as for example title, abstract, type, etc., while others are common to all instances, as for example electronic format, access level, etc.

On the other hand, some metadata elements are specific to some contents, as for example translation control - original content, translator, etc., and others are specific to some instances, as for example special equipment, expiration date, remote location, etc.

From this comment, we can see that the metadata model must be versatile to contain attributes that are common to all contents and to all instances and also the specific ones, in order to accommodate specialization of the digital library items.

Contents, instances and languages

Contents may be language dependent. The language of the content is the one in which it is written, spoken or sung.

Other languages may be associated with a content - the ones in which it is catalogued. It is possible to describe a content written/spoken/sung in one language in other language(s). This way, there is one catalog entry in each of the languages to be used.

The use of multilingual cataloguing yields points of access in different languages if the search is performed in all of them. This topic will be addressed in the section Database and IR.

Metadata models and languages

It is possible to define the digital library to hold more than one language. A good choice would be, at least, the language(s) of the nation where TDs are developed and English.

If this is the case, the metadata model can have all attributes that are language dependent written in each language to be used in the digital library and the language code must be a part of the primary key in the database.

Attributes that are language independent would have only one representation in the database.

Metadata schemes There are quite a few metadata schemes. Some are strictly related to library items while others have a broader scope, as for example the ones devoted to digital objects to be used in Web Based Education. Some schemes are well known and should be mentioned:

DCMES - Dublin Core Metada Element Set
http://purl.org/dc/documents/rec-dces-19990702.htm
Under the responsibility of the DCMI - Dublin Core Metadata Initiative http://www.purl.oclc.org/metadata/dublin_core/ http://purl.org/dc/ This metadata element set will be presented in the section Cataloging: MARC, DC, RDF
IMS Project - Instructional Management System Project
http://www.imsproject.org/
The metadata element set defined by the IMS Project has the objective of identifying digital objects used in Web based Education. It contains all the elements of the DCMES and many more.
LOM - Learning Objects Metadata of the Learning Technology Standards Committee of the Institute of Electrical and Electronics Engineers (LTSC/IEEE)
http://ltsc.ieee.org/doc/wg12/LOM_WD4.htm/
The metadata element set defined by the LTSC/IEEE (http://ltsc.ieee.org/) has the objective of identifying digital objects used in Web based Education. It contains all the elements of the DCMES and many more.
LoC - Core Metadata Elements of the Library of Congress
http://lcweb.loc.gov/standards/metadata.html
The second and the third are used when WBE is under. Since they contain the DCMES, no conflict exists to the general digital library identification.

Specialization of the metadata models for TDs

Besides the usual data contained in general purpose metadata schemes, there are some types of information related to TDs that may be of interest to the university. For this reason, it may be useful to consider adding extra metadata elements to the traditional metadata schemes. The additional elements can be separated in 3 groups:

Administrative information - department, date of presentation, date of acceptance, financial support, etc.
Academic information - level, mentor, examining committee, etc.
Traditional library information - university, library system, control number, call number, etc.

These may be useful to yield information concerning the graduate programs of the university.

Conclusion - metadata models for ETD's The definition of the metadata model for an ETD digital library must combine:

The needs for proper identification of ETD's for the goals of access to be achieved (national access? international access?)
The administrative needs of the university

At the same time, the restrictions imposed by budget or operation time frames must be to taken into consideration. There is a balance between what is desired and what is possible. Some comments concerning this balance are made:

For international access, the use of English besides the original language(s) is mandatory. This means that titles and abstracts must be translated, and that subjects headings, keywords, etc. will be multilingual catalogs to be maintained.
For the ETD digital library to be a part of the international community, the minimum requirements in terms of ETD identification must be met. This means that at least the DCMES must be used.
For the university to have good control of the intellectual property, the use of content / instance concept allows access specifications to be established on the digital objects. Thus, some objects may be made public while others may have different types of restrictions due to format or to intellectual content.
In the definition of the workflow to operate the ETD program, attention must be given to the capture of the metadata elements. If non-librarians are involved in the process, there must be a good training program and a careful review process so that the attributes are catalogued right.

The choice of the metadata model is very important and the team in charge of the implementation of the ETD program must study the possibilities before making the decision. Minimum standards must be met.

Next Section: Cataloging: MARC, DC, RDF