Jump to content

Open Metadata Handbook/Metadata Standards

From Wikibooks, open books for an open world

STATE OF THE ART

[edit | edit source]

Serialization schemas

[edit | edit source]
  • Turtle, for triples; also RDF/XML
  • XML, for data that can be marked up in a flat record
  • MARC is another serialization schema that can carry a variety of data types (as ISO 2709)



Metadata data models

[edit | edit source]

kc: would be best to separate these into models for metadata elements v. models for value vocabularies. SKOS would be in the latter.

A data model explicitly determines the structure of data or structured data (as opposed to the content thereof). A metadata data model exclusively describes the syntax of the metadata schema, independently from the vocabulary that is being used. It merely describes the entities of the metadata "realm" and is independent of any serialization.


RDF/OWL

[edit | edit source]

The W3C standard Resource Description Framework (RDF) is the default foundation for machine-processable semantics. The RDF data model is not a true metadata schema, but merely provides an abstract, conceptual framework for defining and using metadata, or other metadata models. It can be used to describe or create new models (objects / properties) for the conceptual description or modelisation of information that can be implemented in web resources, using a variety of syntax formats.

The RDF data model imposes structural constraints on the expression of application data models for consistent encoding, exchange and processing of metadata. While it is not the only one, RDF is definitely the main metadata model in use today. It is the most widely deployed and also the one with the largest number of vocabularies. Endorsed by the W3C and many universities, RDF offers a huge set of ontologies and vocabularies already implemented and maintained. Whatever ontologies and vocabularies are properly implemented and maintained in RDF can easily be appropriated by open bibliographic efforts, provided these ontologies and vocabularies are available for use with an open license.

As such, RDF doesn't have a specific domain, it is a generic framework which must be extended with vocabularies and ontologies in order to describe something. The description of resources is based on objects and properties which are themselves described in RDF. With RDF, it is thus possible to describe/generate new vocabularies used to describe resources or things - which can in turn be vocabularies themselves (e.g. the various OWL vocabularies). RDF is formulated as a stratification of concepts and ontologies - to eventually create new concepts. Just as in object-oriented programming, where one can create new classes by extending other classes, RDF allows to create new concepts by extending other concepts. The difference is that RDF is property-oriented as opposed to object-oriented.

In RDF, everything is based on the concept of "semantic triples": Subject, Property, Object

  • the Subject is the resource identified by an URI / URL
  • the property is another resource identified by an URI. it must be defined elsewhere (e.g. they can be extracted from a dictionary, namespace, schema, or ontology)
  • the object can be an URI, or a "value": a string, a number, etc. Or eventually it can be a blank node (http://en.wikipedia.org/wiki/Blank_node)

RDF also defines the basic concepts for other ontologies to build upon them. These basic elements are:

  1. Classes: Resource, Class, Property, List, Literal, Numbers, etc.
  2. Properties: 'to be' => 'type', subClassOf, subPropertyOf, label, etc.

Everything else can be derived from that, i.e. every class of every vocabulary RDF/OWL will always be a rdf:class If one needs a property that does not exist yet, one can write a RDF document that creates/describes it. Once that property has been defined, it now exists and can be used into any other RDF document.

For instance, the FOAF ontology provides a definition of Foaf:Person as a RDF:Class described as follows:

<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>    // the entity is of type OWL Class
rdfs:label="Person"                                               // the name of the entity is "Person"
rdfs:comment="A person."
<rdfs:subClassOf><owl:Class rdf:about="http://xmlns.com/foaf/0.1/Agent"/></rdfs:subClassOf>      // the entity is a subclass of the Class Agent
<owl:disjointWith rdf:resource="http://xmlns.com/foaf/0.1/Org"/>  // the entity has the property of being disjoint with the entity Organisation

See e.g. http://www.w3.org/People/Berners-Lee/card.rdf that describes Berners-Lee using various vocabularies (OWL's)


  • Extensibility and adaptability
  • RDF can be expressed in 3 different ways (turtle, n3, xml) and can potentially be used to describe anything.
  • RDF allows different communities to define their own semantics: anyone can create new ontologies based upon pre-existing ontologies to describe new resources
  • RDF permits the integration of an indefinite number of ontologies (as dictionaries of terms/properties/resources) within the same RDF file
  • Popularity:
  • RDF is endorsed by W3C and used in many academic projects. It is easy to find well maintained and well documented RDF ontologies online.
  • Open Bibliographic Data
  • SPARQL:
  • SPARQL is an extremely powerful query system that can be used to query the database in which RDF metadata has been inserted
  • External dependency:
  • In order to describe anything, RDF must necessarily rely on one or more external sources.
  • Resource intensive:
  • RDF might require big triple stores (with hundreds of millions of triples) and SPARQL systems which might turn out to be too heavy. Many institutions currently do not have the infrastructure to handle that well.
  • Excessive burden and lack of scalability for what should be simple bibliographic tasks like managing a few million bibliographic records.
  • Open Bibliographic Data
  • RDF may be fine as an abstract model, but its practical implementation for open bibliographic purposes remains to be provided and supported. Only very big players can manage the infrastructure necessary to deal with RDF (and they cant be trusted to keep the data open)?
  • SPARQL:
  • With SPARQL, if a query is not entirely predictable, it could result in NP (i.e. it could not return in any determined amount of time)

BibTex  ???

[edit | edit source]

BibTex is reference management software for formatting lists of references. The BibTeX tool is typically used together with the LaTeX document preparation system. It is a system that can be expanded to support 'dictionaries' (called styles) in order to cover other fields of applications - but it is not itself a metadata format (even though it can be used as such). So far, it has in fact been used both as a format for aggregating millions of references containing article metadata, and as a format for provision of faceted displays of biblio data. Neither purpose was intended by its creator, but the BibTeX creator showed good judgement in the flexibility and extensibility of the BibTeX data model, so it has been usable (though stretched) for these other purposes.

  • PDF: Is it even worth mentioning ? It is not directly relevant to the guide, and it seems like it merely adds more complexity to me.
  • JP: BibTex should certainly not be recommended for new metadata creation. Rather, its reincarnation in BibJSON should be preferred, along with some more rigorous formulation of BibJSON, e.g. using JSON schema http://tools.ietf.org/html/draft-zyp-json-schema-03

Metadata schemas

[edit | edit source]

Metadata syntax refers to the rules created to structure the fields or elements of metadata. A single metadata scheme may be expressed in a number of different markup or programming languages, each of which requires a different syntax. For example, Dublin Core (a metadata schema) may be expressed in (kc: any serialization) plain text, HTML, XML and RDF. The reason is that DC is not a single thing. (KC: not true. It's because the properties are defined in RDF, which is serialization neutral. This is true of any RDF-defined metadata.) DC is a consortium that releases different specifications for every typology of metadata so that DC can be used anywhere and in any way.


Based on metadata data models

[edit | edit source]

e.g. the various OWL based on RDF

Metadata schemas based on metadata data models can be regarded as self-descriptive metadata: the metadata contains sufficient information for the component, its properties and its relationship to other entities to be completely self-describing.

OWL (vocabularies) - OWL are based on RDF. They provide the semantic linkage needed by intelligent agents to extract valuable information from the raw data defined by RDF triples. Anything that is OWL-compatible is necessarily RDF, but not vice-versa because OWL is a subset of RDF (just as RDF/xml is a subset of XML) A variety of ontologies have already been developed, each with a specific purpose in mind. If none of the existing ontologies are adequate for a particular application, a new ontology can be created.

kc: OWL and FOAF are in entirely different categories. OWL is a language for defining metadata schemas, FOAF is an implementation. They shouldn't be in the same section.

Friend of a Friend (FOAF) RDF vocabulary, described using W3C RDF Schema and the Web Ontology Language. Conceived for the description of groups and persons, it provide basic properties and resources to express concept such as: friend of, son of, lives in, works in, knows someone, is mine, etc For more information, see: http://xmlns.com/foaf/0.1/index.rdf http://www.foaf-project.org/

Dublin Core

[edit | edit source]

describe physical resources such as books, digital materials such as video, sound, image, or text files, and composite media like web pages. Metadata records based on Dublin Core are intended to be used for cross-domain information resource description and have become standard in the fields of library science and computer science. The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 metadata elements: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights Implementations of Dublin Core typically make use of XML and are Resource Description Framework based. For more information, see: http://dublincore.org/documents/dcmi-terms/ http://dublincore.org/2010/10/11/dcterms.rdf The components of a Dublin Core Application Profile relate to "domain standards" (models and specifications in broader use by communities) and to RDF. - Description Set Profiles are based on the DCMI Abstract Model (DCAM) inasmuch they specify how the entities of the DCAM are used in a specific set of metadata. In this sense, the DCAM constitutes a broadly recognized model of the structural components of metadata records. The DCAM, in turn, is grounded in RDF. - Description Set Profiles typically use properties and classes defined in standard Metadata Vocabularies such as the DCMI Metadata Terms. Metadata Vocabularies, in turn, are expressed on the basis of the RDF Vocabulary Description Language (also known as RDF Schema, or RDFS). - The Domain Model used in an application is often based on a domain model in wider use; for example, the generic model Functional Requirements for Bibliographic Records (FRBR) is an important point of reference for resource description in the library world.


an extension of Dublin Core for the description of bibliographic data. The Bibliographic Ontology Specification provides main concepts and properties for describing citations and bibliographic references (i.e. quotes, books, articles, etc).

POWDER

[edit | edit source]

The Protocol for Web Description Resources (POWDER) is the W3C recommended method for describing web resources. It specifies a protocol for publishing metadata about Web resources using RDF, OWL, and HTTP. For more information, see: http://www.w3.org/2007/05/powder-s


The Semantic Publishing and Referencing Ontologies http://sempublishing.svn.sourceforge.net/viewvc/sempublishing/SPAR/index.html form a suite of orthogonal and complementary ontology modules for creating comprehensive machine-readable RDF metadata for all aspects of semantic publishing and referencing. The component ontologies within SPAR are named in the flower diagram below (Figure 1). The ontologies can be used either individually or in conjunction, as need dictates. Each is encoded in the Web ontology language OWL 2.0 DL. Together, they provide the ability to describe far more than simply bibliographic entities such as books and journal articles, by enabling RDF metadata to be created to relate these entities to reference citations, to bibliographic records, to the component parts of documents, and to various aspects of the scholarly publication process. All eight SPAR ontologies – FaBiO, CiTO, BiRO, C4O, DoCO, PRO, PSO and PWO – are available for inspection, comment and use. They are useful for describing bibliographic objects, bibliographic records and references, citations, citation counts, citation contexts and their relationships to relevant sections of cited papers, the organization of bibliographic records and references into bibliographies, ordered reference lists and library catalogues, document components, publishing roles, publishing status and publishing workflows. Where appropriate, the SPAR ontologies, specifically FaBiO, the FRBR-aligned Bibliographic Ontology, and BiRO, the Bibliographic Reference Ontology, employ the FRBR (Functional Requirements for Bibliographic Records) cataloguing model, a conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) as a generalized view of the bibliographic universe, intended to be independent of any cataloging code or implementation [Sau98, Til03]. FRBR distinguishes Works, Expressions, Manifestations and Items.

Geo is a basic RDF vocabulary that provides the Semantic Web community with a namespace for representing lat(itude), long(itude) and other information about spatially-located things, using WGS84 as a reference datum. For more information, see: http://www.w3.org/2003/01/geo/

GeoNames

[edit | edit source]

The GeoNames Ontology makes it possible to add geospatial semantic information to the Word Wide Web. All over 6.2 million geonames toponyms now have a unique URL with a corresponding RDF web service. For more information, see: http://www.geonames.org/ontology/documentation.html kc: this section should separate metadata properties from value vocabularies (controlled lists of terms that are used as the values of properties, like geo names, ISO language codes, various subject headings and thesauri)

Functional Requirements for Bibliographic Records standardises a set of terms and relationships that are essential to any cataloguer. For more information, see: http://purl.org/vocab/frbr/frbr-core-20050729.rdf http://purl.org/vocab/frbr/core http://metadataregistry.org/schema/show/id/5.html this is the "official" version -- the others are out of date and not approved by the FRBR development group. Also, FRBR is both a general model and a set of properties.

CIDOC/CRM

[edit | edit source]

The CIDOC Conceptual Reference Model (CRM) is a formal ontology that provides definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation. It provides an integrated framework for different kind of resources: archives, images, places, objects For more information see: http://www.cidoc-crm.org/rdfs/cidoc-crm

      • CIDOC generally used by museums for describing artefacts not bibliographic entities? Are there real world examples of it being used for bib data?

for the description of code, license, repository, authors, patches, etc Never seen DOAP used in bibliographic metadata? Examples?


Digital Resource Terms

[edit | edit source]

for describing and linking to digital resources. These are extensions to the Dublin Core Element Set and Dublin Core Qualifiers used in the Digital Resource Description (DRD) Application Profile (http://www.natlib.govt.nz/dr/drd.html). For more information, see: http://www.natlib.govt.nz/dr/drterms.rdf http://www.natlib.govt.nz/dr/terms

Digital Resource Role

[edit | edit source]

Controlled term vocabulary for describing the role a digital asset plays in a digital resource. It is intended for use in the Digital Resource Description (DRD) Application Profile (http://www.natlib.govt.nz/dr/drd.html). It was originally developed by the National Library of New Zealand to assist in tracking multiple derivative files created from source digital files. For more information, see: http://www.natlib.govt.nz/dr/drrole.rdf http://www.natlib.govt.nz/dr/role

BibTeX in OWL

[edit | edit source]

A recasting of the BibTeX bibliographic markup language in OWL for use in RDF and semantic web applications. For more information, see: http://zeitkunst.org/projects/bibtex-owl

PRISM

[edit | edit source]

Publishing Requirements for Industry Standard Metadata http://www.idealliance.org/specifications/prism/

PBCORE

[edit | edit source]

Public Broadcasting Metadata Dictionary Project http://pbcore.org

PREMIS

[edit | edit source]

http://loc.gov/premis/

Resource Description and Access (RDA) This is the most recent set of library cataloging rules, and is supported by an element set that is defined in RDF. RDA is an implementation of the FRBR model. It has about 1400 properties and over 60 term lists. It covers text, sound, film, cartographic materials, and objects, as well as archival materials. http://metadataregistry.org/rdabrowse.htm/

CG: This looks very relevant, especially as it is defined in RDF (didn't know that)

Semantic Publishing and Referencing For citations, including a vocabulary of citation types (CITO) http://purl.org/spar/fabio/ http://purl.org/spar/cito

Canonical Citations

[edit | edit source]

Key/Encoded-Value Metadata Format for Canonical Citations http://alcme.oclc.org/openurl/servlet/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=info:ofi/fmt:kev:mtx:canonical_cit



(W3C Standard) kc: again, this is a language, not an implementation The Simple Knowledge Organization System (SKOS) provides a RDF model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary. It can be used on its own, or in combination with more-formal languages such as the Web Ontology Language (OWL). The aim of SKOS is not to replace original conceptual vocabularies in their initial context of use, but to allow them to be ported to a shared space, based on a simplified model, enabling wider re-use and better interoperability. SKOS introduces the class skos:Concept, which allows implementers to assert that a given resource is a concept. In basic SKOS, conceptual resources (concepts) are identified with URIs, labeled with strings in one or more natural languages, documented with various types of note, semantically related to each other in informal hierarchies and association networks, and aggregated into concept schemes. More info at http://www.w3.org/TR/skos-primer/

MADS is an another standard for describing subjects, names, and other "authorities". There is an RDF vocabulary for it, and the US Library of Congress now uses it (as well as SKOS) to export authority information. See description at http://www.loc.gov/standards/mads/ - XML format for authority data (derivative of MARC 21 authorities) - Descriptions for names, subjects, titles, geographics, genres - Uses same structures as MODS

RDFa (w3c recommendation) http://www.w3.org/TR/xhtml-rdfa-primer/ RDF embedded in HTML documents.

Independent from any metadata data model

[edit | edit source]

e.g. custom-based format that relies upon a particular markup language (JSON, XML or whatever). Metadata schemas that are not based on a metadata data model are not self-descriptive: the meaning of the markup language is implemented in the logic of the parser: the metadata is not self-descriptive. Custom-based format that relies upon a particular markup language (JSON, XML or whatever). They each define their own specification with a specific series of tags that can be considered valid. e.g. Facebook, Twitter, Google's API

PRO: - MUCH easier to deal with and can often achieve an analogous result - documents are easy to parse - there are no hierarchic dependencies of any sort - extremely handy for database insertion and extraction (e.g. bigtable of google, couchdb, non-relational db, NoSQL, etc) - Keeping the formats as simple as possible lowers the barrier to compliance.

CONS: - most of these standards are inherently incompatible with each other - metadata cannot be processed unless appropriate documentation is provided - The meaning of the markup language is implemented in the logic of the parser: the metadata is not self-descriptive.

Drawbacks of Library-specific standards:

  • Lack of Standardisation: Many library standards, such as MARC or Z39.50, have been developed or are developed in a library-specific context. Standardization in libraries is often undertaken by bodies only dedicated to the domain, such as IFLA or the JSC for development of RDA.



Dublin Core

[edit | edit source]

Dublin Core has been implemented into a standard that is actually independent from RDF. Can potentially be incorporated in any standard, e.g. XML: http://dublincore.org/documents/dc-xml-guidelines/

  • Dublin Core is a stable and well defined standard.
  • it provides a core of semantically interoperable properties
  • It is made of a variety of fields which have been specifically and accurately defined.
  • It is a good standard to be imposed as a working rule for a database over which there is full control
  • problem if it is necessary to deal with data from others that may or may not have all the required elements.
  • cannot benefit from additional metadata that is outside the scope of Dublin Core:

e.g. a photograph may contain metadata such as: type of camera they were shot on, settings (F-number, zoom level, ISO..), location, etc even if it is useful metadata, this kind of information is outside the scope of Dublin Core and cannot be accounted for Any freeform or extensible metadata system (e.g. key-value pairs) will suffice to resolve that drawback. (The advantage of RDF is that it can not handle this naturally, but it can also deal with modifications over time.)

Schema.org

[edit | edit source]

Schema.org is an initiative launched on 2 June 2011 by Bing, Google and Yahoo! to introduce the concept of the Semantic Web to websites. On 1 November Yandex (the largest search engine in Russia) joined the initiative. The operators of the world's largest search engines propose to mark up website content as metadata about itself, using microdata, according to their schemas. Those schemas can be recognized by search engine spiders and other parsers, thus gaining access to the meaning of the sites. The initiative started with a small number of formats, but the long term goal is to support a wider range of schemas Schema.org provides a collection of schemas (i.e. html tags) which can be used for simple bibliographic data and is currently being pushed by major search engine companies (e.g. Google. Bing, Yahoo!) Many sites are generated from structured data, which is often stored in databases. When this data is formatted into HTML, it becomes very difficult to recover the original structured data. Many applications, especially search engines, can benefit greatly from direct access to this structured data. On-page markup enables search engines to understand the information on web pages and provide richer search results in order to make it easier for users to find relevant information on the web. Markup can also enable new tools and applications that make use of the structure.

Here's a quick overview of the properties a Schema.org/Book can have (the values in parentheses indicate a type for the property value): Properties from http://schema.org/Thing

  • description
  • image(URL)
  • name
  • url(URL)

Properties from http://schema.org/CreativeWork

  • about(Thing)
  • aggregateRating(AggregateRating)
  • audio(AudioObject)
  • author(Person or Organization)
  • awards
  • contentLocation(Place)
  • contentRating
  • datePublished(Date)
  • editor(Person)
  • encodings(MediaObject)
  • genre
  • headline
  • inLanguage
  • interactionCount
  • isFamilyFriendly(Boolean)
  • keywords
  • offers(Offer)
  • publisher(Organization)
  • reviews(Review)
  • video(VideoObject)

Properties from http://schema.org/Book

  • bookEdition
  • bookFormat(BookFormatType)
  • illustrator(Person)
  • isbn
  • numberOfPages(Integer)

Example: The following is an example of how to embed information about a movie and the structure of the information into a website. In order to mark up the data the attribute itemtype along with the URL of the schema is used. The attribute itemscope defines the scope of the itemtype. The kind of the current item can be defined by using the attribute itemprop. Within the schema for a movie is a schema for a person.

<div itemscope itemtype="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <div itemprop="director" itemscope itemtype="http://schema.org/Person">
  Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954</span>)
  </div>
  <span itemprop="genre">Science fiction</span>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

An OPAC that publishes unstructured data produces HTML that looks something like this:

<div> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first step is to mark something as the root object. You do that with the itemscope attribute:

<div itemscope> 
<h1>Avatar</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

A microdata-aware search engine looking at this will start building a model..

The second step, using microdata and Schema.org, is to give the object a type. You do that with the itemtype attribute:

<div itemscope itemtype="http://schema.org/Book"> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Now the object in the model has acquired the type "Book" (or more precisely, the type "http://schema.org/Book".

Next, we give the Book object some properties:

<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

So far, all the property values have been simple text strings. We can also add properties that are links:

<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>

The model grows.

Finally, we want to say that the author, Paul Bryers, is an object in his own right. In fact, we have to, because the value of an author property has to be a Person or an Organization in Schema.org. So we add another itemscope attribute, and give him some properties:

<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <div itemprop="author" itemscope itemtype="http://schema.org.Person">
Author:  <span itemprop="name">Paul Bryers</span> 
(born <span itemprop="birthDate">1945</span>)
 </div>
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>

MARCXML

[edit | edit source]

An international descriptive metadata format A legacy format in which lots of bibliographic information is still maintained. Components:

  • Markup: data element set
  • Semantics: meaning of elements (but content defined by other standards)
  • Structure: syntax for communication

MARC fields are connected with the International Standard Book Description (ISBD), developed by the international library community through decades, where elements are marked by punctuation. Although ISBD may look complex, also very simple uses are allowed, such as: Title / Author. - City : Publisher, year.

There are many different MARC versions: national agencies in France, US, UK etc. originally developed their own national MARCs, which were then unified in an international UNIMARC. However, in the last years, US MARCs have imposed themselves over UNIMARC due to their adoption in US catalogs of which data are also imported outside US. So in practice library catalogues in different countries will be using different MARC versions.

  • Take advantage of XML: establish standard MARC 21 in an XML structure
  • Allow for interoperability with different schemas through coordinated set of tools

e.g. widespread use of bibliographic utilities and ILS implementations based on MARC for standard communication format with predictable content & for the sharing of records c.f. Open MARC 21 to XML programming tools and presentation style sheets:

  • Standardize MARC 21 for OAI harvesting
  • Standardize transformations to and from other standard formats (DC, ONIX, …)

(Metadata Object Description Schema) a derivative (subset) of MARC elements to create a simpler but compatible alternative a rich (but not too rich) XML metadata format for emerging initiatives

  • as an extension schema to METS (Metadata Encoding and Transmission Standard)
  • to represent metadata for harvesting (OAI)
  • as interoperable core for convergence between MARC & non-MARC XML schemas
  • for packaging metadata with a resource (e.g. METS)

= specifically designed for library applications, although it could be used more widely

  • Uses language-based tags
  • Elements generally inherit semantics of MARC
  • MODS does not assume the use of any specific cataloging code
  • MODS is particularly useful for
  • compatibility with existing bibliographic data
  • embedded descriptions in related item
  • rich, hierarchical descriptions that work well with METS structural map
  • “out of the box” schema; can use <extension> for local elements and to bring in external elements from other schemas

http://www.refman.com/support/risformat_intro.asp Probably most widely supported format for bibliographic references. Widely supported by commercial software tools and services. What about open tools and services? Then BibTeX might win.

  • Simple
  • Widely used
  • Proprietary (I think) Would be interesting to know exact IP status of the format. Certainly most tools and services are proprietary. RefWorks and friends.
  • Simplistic
  • Specification doesn't always match usage - e.g. specification lacks a tag for DOI although DO widely used and understood


ONIX International from EDItEUR (XML-based book publishers' metadata standard) http://www.editeur.org/12/About-Release-3.0/


BibJSON

[edit | edit source]

http://bibserver.okfn.org/bibjson/ BibJSON is a simple description of how to represent bibliographic metadata in JSON. Also based on the BibTeX model. A JSON object is an unordered list of key-value pairs A BibJSON object is a bibliographic record as a JSON object BibJSON is just JSON with some agreement on what we expect particular keys to mean. We would like to write parsers from various other formats into BibJSON, to make it easier for people to share bibliographic records and collections. See http://bibserver.okfn.org/roadmap/open-bibliography-for-stm/ http://www.bibkn.org/bibjson/index.html

Metadata protocols and containers

[edit | edit source]

Protocols

[edit | edit source]

OAI-PMH

[edit | edit source]

(Open Archives Initiative Protocol for Metadata Harvesting) A protocol developed by the Open Archives Initiative. It is used to harvest (or collect) the metadata descriptions of the records in an archive so that services can be built using metadata from many archives. Especially when dealing with thousands of files being harvested everyday, OAI-PMH can help in reducing the network traffic and other resource usage by doing a incremental harvesting. The mod_oai project is using OAI-PMH to expose content to web crawlers that is accessible from Apache Web servers.

  • An implementation of OAI-PMH must support representing metadata in Dublin Core, but may also support additional representations.

The OAI Protocol has become widely adopted by many digital libraries, institutional repositories, and digital archives. Although registration is not mandatory, it is encouraged. There are several large registries of OAI-compliant repositories: - The Open Archives list of registered OAI repositories - The OAI registry at University of Illinois at Urbana-Champaign - The Celestial OAI registry - Eprint’s Institutional Archives Registry - Openarchives.eu - The European Guide to OAI-PMH compliant repositories in the world - ScientificCommons.org - A worldwide service and registry

Commercial search engines have started using OAI-PMH to acquire more resources: - Google did accept OAI-PMH as part of their Sitemap Protocol, though decided to stop doing so in 2008. Google is now using OAI-PMH to harvest information from the National Library of Australia Digital Object Repository. - Yahoo! acquired content from OAIster (University of Michigan) that was obtained through metadata harvesting with OAI-PMH (2004). - Wikimedia uses an OAI-PMH repository to provide feeds of Wikipedia (and sister projects) updates for search engines and other bulk analysis/republishing endeavors. - NASA's Mercury: Metadata Search System uses OAI-PMH to index thousands of metadata records from Global Change Master Directory (GCMD) everyday.


  • Atom Publishing Protocol (AtomPub or APP) is a simple HTTP-based protocol for creating and updating web resources.
  • Atom Syndication Format is an XML language used for web feeds (a feed contains entries, which may be headlines, full-text articles, excerpts, summaries, and/or links to content on a website, along with various metadata). The Atom format was developed as an alternative to RSS

Proponents of the new format formed the IETF Atom Publishing Format and Protocol Workgroup. The Atom syndication format was published as an IETF proposed standard in RFC 4287 (December 2005), and the Atom Publishing Protocol was published as RFC 5023 (October 2007). Atom 0.3, released in December 2003 gained widespread adoption in syndication tools, and in particular it was added to several Google-related services, such as Blogger, Google News, and Gmail. Google's Data APIs (Beta) GData are based on Atom 1.0 and RSS 2.0.

All Atom feeds must be well-formed XML documents, and are identified with the application/atom+xml media type. TODO: which formats are favored by Atom ?


SPARQL

[edit | edit source]

SPARQL stands for SPARQL Protocol and RDF Query Language. It allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns.SPARQL allows users to write globally unambiguous queries. More info available at: http://www.w3.org/TR/rdf-sparql-protocol/ http://en.wikipedia.org/wiki/SPARQL

Extensible Messaging and Presence Protocol (XMPP) is an open-standard communications protocol for message-oriented middleware based on XML (Extensible Markup Language).[1] The protocol was originally named Jabber,and was developed by the Jabber open-source community in 1999 for near-real-time, extensible instant messaging (IM), presence information, and contact list maintenance. Designed to be extensible, the protocol today also finds application in VoIP and file transfer signaling. With XMPP it would be possible to catch event streams while cataloging in near real time. More info available at: http://en.wikipedia.org/wiki/Extensible_Messaging_and_Presence_Protocol


Z39.50

[edit | edit source]

The most widely deployed and currently active (In production systems) method of interoperably searching remote library catalogs. Currently supported by all national libraries, most academic libraries and many public/private collections.

Z3950 is a BER-encoded ASN.1 defined stateful session based protocol for information retrieval. Although its primary function is access to a single remote target, the protocol forms the backbone of many contemporary broadcast and meta-search systems (Virtual Union Catalogs) where live search is needed, although HTTP-based alternatives exist such as the MetaOPAC Azalai Italiano (MAI). This can be compared to Physcial union catalogs where all data is harvested into a single repository. The obvious advantage of the virtual union catalog is real time updating of holdings and availability information, and the delegation of security evaluation to leaf nodes in the network (IE where security cannot be delegated to a single harvest node).

The protocol itself does not mandate a record syntax (MARC,XML,GRS,etc) instead only specifying the semantics for the retrieval operation. Different record syntaxes can be used to convey different semantics about bib items. For example, national marc variants are the commonly used payload for bib information, other syntaxes such as the GRS-1 encoded opac-1 format can be used to query real-time availability and holdings information. Clients are free to request multiple encodings of the same record. Taking advantage of this capability, Z3950 can also be used as a datasource for library reservations and interlending subsystems, although those features are more commonly supported by newer circulation protocol (Which often suffer from less consensus amongst vendors at the interoperability level). Z3950 also features an extended service facility to provide services such as item order and record upload.

Z3950 should not be confused with an indexing system such as Apache SOLR. Z3905 specifies a standard interface which is used as an openly defined access layer to a retrieval index. There are currently at least 2 Z3950 <-> SOLR bridges in existence.

Additional Info

Index data maintain a useful meta-index of the publically available z3950 targets and their capabilties at http://irspy.indexdata.com/.

http://en.wikipedia.org/wiki/Z39.50

Perhaps the most valuable part of Z3950 lies in its rich heritage of interoperability and cooperation amongst vendors. The Z3950 implementers group consisted of representatives from libraries and software vendors and sought to avoid many of the issues present in creating interoperable bib systems.

A major advantage of the protocol however is the way it insulates a retrieval endpoint from changes in indexing technology and record payload.

Z3950 is also used to serve up subject thesauri and other controlled vocabulary lists.

Being payload agnostic, Z3950 has been deployed in a number of different scenarios over the years, from providing searchable access to US Government Information (GILS), Cultural Datasets (PADS, the performing arts data service), Archives (ArchivesHub), US Geological Survey and their spatial data clearing service (USGeo). Z3950 lays out a framework which allows the interoperable cross search of all these diverse information types by defining abstract search access points (Use attributes).

One major criticism of Z3950 is the lack of a standardised identifier based access to items in the server. Item level access is through ordinal position in a result set, there is no direct access to items by unique ID. This means most item level access has to be profiled as a search for a particular unique ID, and a retrieval operation for that result. Whilst in practice this isn't an issue, it can make working with the protocol feel unwieldy at first.

Modern cross-search systems based on z3950 are often criticised for not providing a good user experience. However, many of the problems highlighted are inherent in cross-searching and aren't specifically z3950 issues. There are, however, a number of poorly behaved z3950 targets in existence, and the developer community has built up a large array of work-arounds and knowledge about the landscape of z3950 targets.

SRU/SRW

[edit | edit source]

SRU Emerged out of discussions in the Z3950 implementers group, who recognised the need for a REST-like replacement / alternative for the original BER encoded protocol. Many of the same application structures can be found in SRU (REST-like URL Based retrieval) / SRW (SOAP) as in the source z3950 protocol.

Additional Info

http://www.loc.gov/standards/sru/



Containers

[edit | edit source]

The METS schema is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained in the Network Development and MARC Standards Office of the Library of Congress, and is being developed as an initiative of the Digital Library Federation. METS, a Digital Library Federation initiative, attempts to build upon the work of MOA2 and provide an XML document format for encoding metadata necessary for both management of digital library objects within a repository and exchange of such objects between repositories (or between repositories and their users). Depending on its use, a METS document could be used in the role of Submission Information Package (SIP), Archival Information Package (AIP), or Dissemination Information Package (DIP) within the Open Archival Information System (OAIS) Reference Model. Containers such as METS are one way of addressing the problem of combining descriptive and non-descriptive metadata. METS document consists of seven major sections:

  • METS Header - The METS Header contains metadata describing the METS document itself, including such information as creator, editor, etc.
  • Descriptive Metadata - The descriptive metadata section may point to descriptive metadata external to the METS document (e.g., a MARC record in an OPAC or an EAD finding aid maintained on a WWW server), or contain internally embedded descriptive metadata, or both. Multiple instances of both external and internal descriptive metadata may be included in the descriptive metadata section.
  • Administrative Metadata - The administrative metadata section provides information regarding how the files were created and stored, intellectual property rights, metadata regarding the original source object from which the digital library object derives, and information regarding the provenance of the files comprising the digital library object (i.e., master/derivative file relationships, and migration/transformation information). As with descriptive metadata, administrative metadata may be either external to the METS document, or encoded internally.
  • File Section - The file section lists all files containing content which comprise the electronic versions of the digital object. <file> elements may be grouped within <fileGrp> elements, to provide for subdividing the files by object version.
  • Structural Map - The structural map is the heart of a METS document. It outlines a hierarchical structure for the digital library object, and links the elements of that structure to content files and metadata that pertain to each element.
  • Structural Links - The Structural Links section of METS allows METS creators to record the existence of hyperlinks between nodes in the hierarchy outlined in the Structural Map. This is of particular value in using METS to archive Websites.
  • Behavior - A behavior section can be used to associate executable behaviors with content in the METS object. Each behavior within a behavior section has an interface definition element that represents an abstract definition of the set of behaviors represented by a particular behavior section. Each behavior also has a mechanism element which identifies a module of executable code that implements and runs the behaviors defined abstractly by the interface definition.

A more detailed explanation of each section and their inter-relations can be found at http://www.loc.gov/standards/mets/METSOverview.v2.html

OAI-ORE

[edit | edit source]

Open Archives Initiative Object Reuse and Exchange (OAI-ORE) defines standards for the description and exchange of aggregations of Web resources. These aggregations, sometimes called compound digital objects, may combine distributed resources with multiple media types including text, images, data, and video. On the Web that we use on a daily basis, URIs are used primarily to identify Web documents. They are identifiers that, when dereferenced, return a human-readable Representation. However, on the Semantic Web, URIs are introduced to identify so-called real world entities, such as people or cars, or even abstract entities, such ideas or classes. Since these things are not documents, they have no Representation to indicate what these Resources mean. The Linked Data Effort [Linked Data Tutorial: http://www.openarchives.org/ore/1.0/primer.html#ref-linked-data] describes an approach for obtaining information about those Resources despite the fact that they have no Representation. ORE is based on 4 key notions (classes): • Object: the book/painting/program being described • Aggregation: organizes object information from a given provider (museum, archive, library) it expresses which Aggregation resource it describes (the ore:describes relationship), and lists the resources that are part of the Aggregation (the ore:aggregates relationship). • Digital representation: some digital form of the object with a Web address • Proxy: the metadata record for the object ORE supports Resource Map serializations in RDF/XML, RDFa, and Atom XML. More info available at: http://www.openarchives.org/ore/1.0/primer.html http://www.openarchives.org/ore/1.0/toc.html

WHO USES WHAT

[edit | edit source]

See http://ckan.net/group/lld (list of library datasets)

The level of maturity or stability of currently available metadata schemas varies greatly. Many are the result of ongoing project work, or the result of individual initiatives, and describe themselves as prototypes rather than mature standards. More and more established institutions are committing resources to linked data projects, from the national libraries of Sweden, Hungary, Germany, France, the Library of Congress and the British Library, to the Food and Agriculture Organization of the United Nations, not to mention OCLC. These institution can provide a stable base on which library linked data will build over time.

Every major library in the UK/USA will use MARC21, as will many European libraries. In Germany widely used is MAB2 and Pica. This will be used for record creation, data exchange and internal storage.

British Library Data Model

[edit | edit source]

http://www.bl.uk/bibliographic/pdfs/british_library_data_model_v1-00.pdf http://www.bl.uk/bibliographic/pdfs/britishlibrarytermsv1-00.pdf

@prefix xxx

define the ontology from which Classes and Properties can be extracted

a owl:Ontology;

a => rdf:type - 'a' is a Verb (property) that is defined within the rdf vocabulary
owl:Ontology - the Objet is defined in the Ontology mapped as 'owl'

dct: created "2010-06-28"^^xsd:date;

dct:created is defined in Dublic Core Terms
xsd:date is XmlSchema

blt:PublicationEvent a rdfs:Class , owl:Class;

define a new object of rdf:type Class (according to rdfs & owl)

rdfs:label "Publication event"@en ;

define its label according to the rdfs definition of 'label'

rdfs:comment "An event which is the publication of a resource."@en ;

define a comment

rdfs:subClassOf event:Event ;

is a subclass of an event (according to 'event' definition)

rdfs:isDefinedBy blt: .

is defined by . (blt - British Library Terms itself)

Creative Commons

[edit | edit source]

Creative Commons Metadata files have two major parts: a work description, and a license description. The work description uses Dublin Core properties to provide information about the work. For more information, see: http://creativecommons.org/technology/metadata/schema.rdf http://creativecommons.org/learn/technology/metadata/


Europeana Data Model (EDM)

[edit | edit source]

http://pro.europeana.eu/edm-documentation

Goal is to

  1. preserve original metadata - expressed as close as possible to original model
  2. while allowing for interoperability - using mappings to more interoperable level

Requirements: (1) distinction between the 'object' (painting, book, sw) and

  1. the digital representation
  2. the metadata describing that object (+ there can be more than one record)

(2) support for objects that are made of several objects The problem is that there is no standard way to describe the constituents or boundary of an aggregation, this is what OAI-ORE aims to provide: ==> Open Archives Initiative Object Reuse and Exchange (OAI-ORE)

(3) based on existing standard metadata format and standard vocabulary format ==> Dublin Core for metadata representation EDM uses DCMI Metadata Terms specified with an RDF model ==> SKOS for vocabulary representation EDM uses SKOS specified with an RDF model

Library of Congress

[edit | edit source]

SKOS, MADS

  • Digital library projects (Library of Congress)

AV-Prototype: digital preservation for audio and video uses METS and MODS with focus on metadata Cataloging report to use as intermediate level of description


UNESCO's CDS/ISIS library software

[edit | edit source]

Common Communications Format (CCF)


University of California press

[edit | edit source]

Using METS with MODS for freely available ebooks


MusicAustralia

[edit | edit source]

MODS as exchange format between National Library of Australia and ScreenSoundAustralia Allows for consistency with MARC data


Bibliothèque Nationale de France (BnF)

[edit | edit source]

Contact: Romain Wenz, responsible of data.bnf.fr at the Département de l'Information Bibliographique et Numérique of the BnF. Currently only deals with literary/visual resources, but will expand catalogue to musical works soon. Different catalogues use different standards (MARC, DC, ..): lack of internal interoperability - RDF with different ontologies:

  • SKOS: for concepts
  • FOAF: for persons
  • DC/RDA: for resources

BnF provides public RDF dumps for every online resources --> /rdf.xml

Gallica

[edit | edit source]

Centre Pompidou Virtuel

[edit | edit source]

RDF


Archives de France

[edit | edit source]

Contact: Claire Sibille, responsable du bureau du traitement des archives et de l'informatisation au Service interministériel des Archives de France du Ministère de la Culture et de la communication Thesaurus W for the indexation of local archives published by the Archives de France

  • EAD (Encoded Archive Description)
  • EAC-CPF (Encoded Archive Context - Collectivities, Persons, Families)

History: 1. XML, 2. excel sheets, 3. XML/SKOS (with ThManager) Today:

  • URI identification for each term + relationship between terms defined by SKOS
  • relationships between these terms defined by RDF triplets
  • thesaurus has been aligned with RAMEAU & DBpedia

Consultation can be made in HTML or RDF/XML + can download the whole DB in rdf + consultation by SPARql requests + web API to the thesaurus

  • URI can be dereferenced in different manners according to the context


Biblioteca Nazionale Centrale di Firenze

[edit | edit source]

Maintain the national bibliography of Italian books and develops the Nuovo Soggettario, a national general thesaurus, also available as SKOS under Creative Commons 2.5 license. Declares to be "defining ways of online publication as Linked Data of produced metadata" at a "first prototypical experimental stage" (contact: Giovanni Bergamin): http://thes.bncf.firenze.sbn.it/thes-dati.htm


SNAC: EAC-CPF

[edit | edit source]

LOCAH: EAC-CPF

[edit | edit source]

Archive Hub, COPAC with Linked Data Creation of links with other databases (e.g. BBC, OCLC, LCSH)..




[edit | edit source]

There exist many models to describe metadata, certain models use specific tools (e.g. bibtex), others use ad-hoc formats (e.g. XML & C, JSON API's, etc).

This from Jenn Riley at Indiana is probably a pretty good starting point for different metadata standards http://www.dlib.indiana.edu/~jenlrile/metadatamap/seeingstandards.pdf

W3C LLD reports

Mailing list: public-lld@w3.org

Contacts: Emmanuelle Bermès, chair of the Incubation Group « Library Linked Data » at the W3C.