NAME

Thesaurus::Overview -- An overview of the Thesaurus mechanism.


DESCRIPTION

This document specifies a standard for distributed thesaurus communication using HTTPD as an access protocol and RDF as the document format.

The basic premises for this distributed thesaurus interaction are:

Each Thesaurus is a unique URL

eg.

 http://ceres.ca.gov/cgi-bin/thesauri/CERES

This allows an unambiguous definition of the distributed thesaurus. It also allows for an interested user to know where to go to get more information regarding this thesaurus.

The thesaurus URL is used to deliver information about the thesaurus to the client, to accept queries to the thesaurus, and to return information about specific terms in the thesaurus.

Each Thesaurus Term is a unique URL

The format of the URL is:

  thesaurus_url?term_identifier

eg.

 http://ceres.ca.gov/cgi-bin/thesauri/CERES?Ecosystems

This allows the user the know exactly what thesaurus is being used, and where to go for more information about the thesaurus and the specific term.

Clients query servers though httpd requests

The client queries the server by sending httpd requests to the thesaurus server. See Thesaurus for a description of the httpd command queries.

Servers respond to clients with RDF responses

The server responds with an html document that contains a section of RDF which the client can parse to retrieve information about terms in the thesaurus. See Thesaurus for a description of the RDF format of response to these queries. The server should also respond with some html, that the client might also use.


Overview

There are a number of standard vocabularies that have relationships that are consistent with the z39.19 standard for thesauri. Some, like the database of thematic keywords being developed at CERES are specially designed to this standard. Some, like the ITIS database, are designed for other purposes, but still conform to all or part of this standard.

The purpose in defining a standard is to allow for a thesaurus client to interact with a number of available on-line thesauri in a consistent manner. This can allow applications that depend on controlled vocabulary access to multiple thesauri without designing an interface to each one.

The communication between the clients and the thesaurus server is simple enough that a protocol based on HTTPD and RDF/XML is sufficient. This standard defines several URLS that act as queries to the server. The HTTPD process on the server receives these queries, translates them to some internal and unspecified action to perform the query, and then formats the result into and RDF/XML document and passes that back to the client. The client application parses the RDF/XML to use for it's own purposes.

       -------------------             ---------------
      | Thesaurus Browser |           | Other Clients |
       -------------------             ---------------
                |                             |
                |     URLS >>                 |
                 --<< RDF/XML ----------------
                               ||
                               ||
                         -------------
                        |    HTTPD    |
                        |   Server    |
                         -------------
                               |
              ------------------------------------
             |                 |                  |
       -------------    --------------   --------------------
      | ITIS Driver |  | mysql Driver | | Additional Drivers |
       -------------    --------------   --------------------
             |                |
       -------------    --------------
      |   ITIS DB   |  |   mysql DB   |
       -------------    --------------


Terminology

All terminology unless specifically noted in this section carries the same meaning as described in NISO39.19. Readers are encouraged to that document for a more complete Glossary of Terms.

Category

A grouping of descriptors that are semantically or statistically associated, but which do not constitute a strict hierarchy base on genus-species or part-whole relationships.

Descriptor

A type of heading that is a term chosen as the preferred expression of a concept in the thesaurus.

EntryTerm

The non-preferred term in a cross reference that leads to a descriptor in a thesaurus.

Term

One or more words that designate a concept within a thesaurus.


Implementation Design Goals

1. This specification shall reasonably describe a thesaurus conforming with the NISO z39.19 Guidelines for Monolingual Thesauri.

2. This specification shall be straightforwardly usable over the Internet.

3. This specification will be created in a timely manner. This specification is oriented towards a beta implementation, and newer versions of this specification will not necessarily be backwards compatible to this specification.

4. This specification is intended as a transfer mechanism for networked thesauri. It is not intended as a specification for user interfaces to that thesauri, however, the specification should be written such that it's realization in a user interface is not complex. For example, an application may retrieve an RDF specification for a thesauri, and insert XLink structures to allow user navigation, without significant modification of the original RDF structure.


Relationship to Existing Standards:

This specification defines a thesaurus with the NISO z39.19 Guidelines for Monolingual Thesauri. Standards and guidelines in that document are required for any thesaurus being utilized with this standard.


Examples

Example of Thesaurus Information

http://ceres.ca.gov/cgi-bin/thesauri/CERES

This URL should respond with an HTML document that gives some description of the anticipated use of this thesaurus, as well as the capabilities of the thesaurus.

Example of a term retrieval

http://ceres.ca.gov/cgi-bin/thesauri/CERES?Ecosystems

This returns information about a term in that thesaurus. If you are viewing this with an HTML browser, view the source of the page and note that besides HTML being returned, the server is also returning RDF, and Javascript.

Example of a term search

http://ceres.ca.gov/cgi-bin/thesauri/CERES?term=Ecosystems

This returns a match based on the default matching strategy to the term 'Ecosystems'. Again, html,RDF,and Javascript are all returned.

Another search example

http://ceres.ca.gov/cgi-bin/thesauri/CERES?match=sql&term=Eco%25&type=Descriptor

This matches only Descriptors based on the SQL matching strategy to the string 'Eco%'

HTML document as a wordlist

http://ceres.ca.gov/thesaurus/ceres_projects.html

This example shows how an html document can be used as a surrogate to a complete thesaurus. The idea here is to allow the same client to use smallish wordlists that are important for a smal community. Examination of the source of this document shows that some Javascript has been included in this file, this is what the client reads to make this html document behave as a wordlist.

Complete Example

http://ceres.ca.gov/javascript/thesaurus_test.html

This is the most complete example of a thesaurus client/server interaction. This URL points to a simple html form. The form has a button that allows the user to pick terms from a number of thesauri. The client fills in the form for the calling html document. The buttons launch helper applications that make calls similar to those above, but parse the results, and then compose new pages on the fly. This allows these applications to directly interact with other HTML forms, little special coding.

NOTE: This is a prototype, and does not use the RDF syntax. Instead, the client uses the returned Javascript code. This is not a language independant solution, and the prototype will be modified.


Unresolved Issues

Richer descriptive text

We need to supply some simple tagging elements to some of the text, so that emphasis can be made, etc.

Partial returns

One of the most difficult parts of this specification will be to somehow get the notation of partial returns. For example, if the client application requests all the Terms, and the Service Application responds only with a subsection, then we need to give the client some notion of Next etc. to complete the transaction.

USE+

There is no notion of the USE+ relationship yet integrated into this specification.

Thesaurus Server information

There needs to be more negotiation between the client and the server about what capabilities the server has. This should be a part of the information section that is returned by the server.

Multiple Hosts for a single Thesaurus.

There needs to be a mechanism in which multiple hosts can be used to server the same thesaurus. I'm in favor of using the PERL CPAN idea to maintain a single point of entry that will allow the client to be redirected to a new location. This isn't completely thought out however.