The UK Data Archive (the Archive) has been experimenting with RDF for a couple of years. When the funding was secured from JISC to apply SKOS to HASSET (Humanities and Social Science Electronic Thesaurus), it was a welcome opportunity for the development team to create a real-life production instance of an RDF dataset that was of a manageable size and which was relatively static. This post gives a brief overview of some of the technologies that the Archive has deployed to deliver a SKOS-based thesaurus.
2. Applying SKOS to the existing HASSET Thesaurus
HASSET is currently stored as relational data in Microsoft SQL Server. The challenge for us was to ‘translate’ the existing relationships as defined in traditional rows and columns into RDF triples.
Each term in HASSET has an explicit relationship type (coded as an integer) to another term. Happily, these relationship types map closely onto the main SKOS predicates:
|HASSET Relationship Between x and y||Proprietary
|Related SKOS Predicates for each skos:Concept|
|Is a Broader Term For||5||skos:broader|
|Is a Narrower Term Of||6||skos:narrower|
|Is a Synonym Of||4||skos:altLabel|
|Should Be Used For||2||skos:prefLabel|
|Is a Related Term Of/For||8||skos:related|
|Is a Top Term For||7||skos:topConceptOf &
Additionally, each SKOS Concept has an additional predicate skos:inScheme, which simply states that the SKOS Concept is in the SKOSHASSET Thesaurus.
3. Creating and Serializing RDF data
Once we understood the constituent SKOS parts of our desired RDF triples, we wrote a “SkosHassetGenerator” class to iterate through each table row, examine the relationship type and generate the appropriate SKOS triple. The UK Data Archive is primarily a .NET organisation and we referenced a number of C# libraries from http://www.dotnetrdf.org/ which is open-source and in turn uses JSON.Net for JSON serialization. The dotNetRDF libraries are well-documented and it this was helpful in generating several serialisations of SKOSHASSET, namely RDF/XML, RDF/JSON, Turtle (which seems to be increasingly popular), NTriples and CSV.
A fundamental part of RDF is a persistent, dereferenceable URI for each SKOS Concept. In testing we used a local, temporary URI. We are planning to move to a humanly-meaningful and logical URI on release, however, which will be a subdomain of data-archive.ac.uk. This is currently being set up and we expect it to be lod.data-archive.ac.uk/skoshasset/<GUID>. More information on the precise URI will follow.
The “SkosHassetGenerator” class is run as a scheduled console application from our Jenkins server on a daily basis. As well as generating physical text files on a network share (which is both useful as a snaphot-archiving mechanism and for physical download to end-users), the class writes the triples into a dedicated Triple Store (see below).
We have identified several benefits to our application of SKOS. Not only does it make the terms easier to maintain and manipulate, but it has also meant that the thesaurus can be more thoroughly validated by 3rd party online tools. One particular favourite of ours is PoolParty.
4. Persisting RDF data
Having generated a SKOS version of the HASSET thesaurus as RDF text files, the next stage was to be able to persist these data in a Triple Store which would allow querying of the data via SPARQL. As we are primarily a .NET house, we selected BrightStarDB. This is open source and is itself built on the same dotNetRDF classes we used to generate the triples in the first place. BrightStarDB also allows us to easily configure a SPARQL endpoint on IIS7 and provides support for Microsoft Entity Framework, which is the Object Relational Mapper we normally use to connect our web services infrastructure to back-end databases.
5. From RDF data to Linked Data
Following on from populating the Triple Store and establishing a SPARQL endpoint, the next stage was to make the SKOS Concept URIs publicly dereferenceable and useful to the wider user community. This initially presented us with a headache. We have approximately 7000 unique terms (or SKOS Concepts) in HASSET. How do you maintain 7000 persistent identifiers on a web server and deliver both HTML and RDF content to users and machines? Fortunately, following some research, we identified another open source product, called Pubby, based on Java components, principally Tomcat, Jena and Velocity. With minor configuration and stylesheet changes, this enabled us to set up a web server to point to our SPARQL endpoint and deliver HTML or RDF content as requested by the end user or machine.
Most of the work has now been completed in terms of applying SKOS to our HASSET Thesaurus. All that remains is to make the SKOSHASSET SPARQL endpoint and Pubby publicly available for testing. This will be completed by the end of the project.
Fig 1. Schematic for generating SKOS Linked Data from HASSET Thesaurus