spacer
XML Gov Logoflag
spacer

Thesuarus/Glossary Approach for the Federal Government

by Ken Sall

Download Presentations:
Version 4: Flexible XML-Based Thesaurus Approach for the Federal Government [PPT] - Apr. 20, 2005 [See newer demos.]
Version 3: Flexible XML-Based Thesaurus Approach for the Federal Government - Highlights [PPT] - Mar. 17, 2005 [See newer demos.]
Version 2: Flexible XML-Based Glossary Approach for the Federal Government - Next Generation [PPT] - Feb. 25, 2005
Candidate Glossary Requirements.doc - Feb. 25, 2005
Version 1: Flexible XML-Based Glossary Approach for the Federal Government [PPT] - Jan. 19, 2005 [See older demos.]

This page illustrates an approach for creating XML-based thesauri/glossaries in the Federal Government. Some of the design goals and advantages of the approach are (more detailed in the PowerPoint slides):

Newer Glossary Demos: See PowerPoint Version 3 and Later [added since 3/15/05]
File Description
Thesaurus XSD Diagram The candidate XML Schema pictured here is a subset of SKOS, an emerging RDF vocabulary from the W3C which is also based on ISO 2788 (Guidelines for the establishment and development of monolingual thesauri). According to the W3C, "SKOS Core provides a model for expressing the basic structure and content of concept schemes (thesauri, classification schemes, subject heading lists, taxonomies, terminologies, glossaries and other types of controlled vocabulary)." SKOS also complements OWL [Web Ontology Language], according to the SKOS Core Guide: "SKOS Core is intended to provide both a stable encoding of thesaurus-like data within the RDF graph formalism, as well as a migration path for exploring the costs and benefits of moving from thesaurus-like to RDFS/OWL-like modelling formalisms."
NOTE: All elements below the element named Related are considered ancillary elements which are still under development. See the description of the main portion of the XSD (part 1) and part 2.
GAO Thesaurus: Excel Implementation This screenshot shows an Excel spreadsheet whose columns adhere exactly to the first 11 elements of the XSD. By filling out the spreadsheet with one row for each term (Concept), it is relatively easy to have another person or potentially a server-side process convert the spreadsheet to valid XML without cutting and pasting.
GAO Thesaurus: XML Conversion from Excel This is the XML that results from the conversion of the GAO Thesaurus spreadsheet. It is validated against the XSD. If your browser will not display it, try this screenshot instead.
DRM Glossary (Mixed Example): XML Conversion from Excel with Tabular XSLT This example, also converted from a spreadsheet to valid XML, applies a simple XSLT stylesheet to render each concept in a seperate HTML table. Notice that for each Concept, only those properties (elements) which have values are displayed. In other words, if a column is left blank in the spreadsheet, both the conversion process and the XSLT rendering handle the omission of optional values properly. A small amount of special case code is used to create a link source and target for the portion of the results which represent the DRM Glossary from the Data Reference Model, Volume 1, Version 1, September 2004. Use your browser to View Source of the XML.
Federal Register Thesaurus as XML data with XSLT This example illustrates the benefit of using XML to represent the data from the Federal Register Thesaurus Of Indexing Terms (from 1995). If you View Source for the link in the left column, you will note it is simply fairly flat XML, with each term represented as a Concept with 4 child elements (prefLabel, broader, scopeNote, and SOURCE). An XSLT stylesheet is used to process the XML data in 2 different ways: first, sorting all 729 concepts alphabetically (different than document order), and second sorting by category (more like document order). As an added feature, in the second sort, each concept is linked to a Google Uncle Sam search for that term (restricted to .gov or .mil sites). For example, if you click on the link for the word fish, the first hit is for the US Fish and Wildlife Service home page; clicking on the word agriculture results in a first hit of the USDA home page. Note that other XSLT stylesheets could readily be produced to process the same Federal Register Thesaurus XML data differently. Use your browser to View Source of the XML.
Older Glossary Demos: See PowerPoint Version 1
File Description
Glossary DTD DTD which defines a strawman Glossary model as a set of one or more Term elements, each of which has a required Name, and zero or more DefinitionSections. A number of optional and repeatable elements comprise each DefinitionSection. The comments in the DTD are helpful to understand the source of the terminology. Elements with the string "TBD" following them are not actually used in the DTD, although they are ISO 1087 terminology.
Glossary XSD XML Schema generated from the DTD using XML Spy. Lacks comments. The XSD shows elements such as PreferredTerm, Nomenclature, and Designation which are defined but not used in the DTD; such terms are from ISO 1087.
Glossary PNG Visual depiction of the Glossary XSD. This reflects only the elements that are actually used. Note which ones are optional and/or repeatable.
Glossary XSLT XSLT stylesheet that first sorts all of the Term elements and then formats each one. The stylesheet also generates on the fly 3 URLs for each Term, resulting in 3 links per Term (not stored in the XML document) for Gogle and WordNet searching. This technique is easily extensible. Although the XSLT is patterned after the DTD, it would be relatively easy to modify the former as the latter changes.
Glossary Example #1 (XML instance) This XML instance represents a fragment of a glossary based on the strawman DTD. The terms were selected to illustrate the flexiblity of the DTD (optionality and repeatability). Only a Term's Name element is absolutely required. This provides a way to simply insert terms to be defined by someone else. Because the instance document refers to the XSLT (as well as the DTD), modern browsers will apply the stylesheet to the instance, resulting in an XHTML rendering in the browser. View Source to see the actual XML markup.
Glossary Example #2 (XML instance) Another XML instance represents a second fragment of a glossary based on the strawman DTD. This can be merged with other instances to form a more complete glossary, as a collaborative development effort.
Glossary Merge XSLT XSLT stylesheet that merges 2 or more glossary XML instances to create a sorted, combined glossary. [TBD]
Result of Merge (XML instance) This is the result of applying the Merge XSLT to the 2 instance documents. [TBD]
DOI FBMS Acronyms as XML The Department of the Interior is implementing the Financial and Business Management System (FBMS). FBMS will provide Interior with standard business practices supported by a single, integrated finance and administrative system for all Bureaus. The FBMS Acronyms have been converted to the strawman glossary format, chiefly to benefit from the generated search links.
DMWG Glossary v0.2.xml DMWG Glossary v0.2.doc converted to CVS and then to XML to match our DTD. Some hand edits required due to invalid characters (not UTF-8) introduced by Microsoft Word. Some table rows combined and a few "tbd" inserted.
GlossXML Diagram Another glossary effort: XML Glossary Standard: Export File Format (GlossXML) with the GlossXML.dtd
XMLAD Diagram Another glossary effort: XML Acronym Demystifier (XMLAD) with the xmlad.xsd (XML Schema). See the XML-related terms that were harvested from XML Acronym Demystifier.
Valid XHTML 1.0! Valid CSS!

Last Updated: March 20, 2005    

Web site content copyright © 2002-2005 Kenneth B. Sall. All Rights Reserved.