From Tuples to Triples: applying SKOS to HASSET – a technical overview

Darren Bell

1. Introduction

The UK Data Archive (the Archive) has been experimenting with RDF for a couple of years. When the funding was secured from JISC to apply SKOS to HASSET (Humanities and Social Science Electronic Thesaurus), it was a welcome opportunity for the development team to create a real-life production instance of an RDF dataset that was of a manageable size and which was relatively static.  This post gives a brief overview of some of the technologies that the Archive has deployed to deliver a SKOS-based thesaurus.

2. Applying SKOS to the existing HASSET Thesaurus

HASSET is currently stored as relational data in Microsoft SQL Server.  The challenge for us was to ‘translate’ the existing relationships as defined in traditional rows and columns into RDF triples.

Each term in HASSET has an explicit relationship type (coded as an integer) to another term.  Happily, these relationship types map closely onto the main SKOS predicates:

HASSET Relationship Between x and y Proprietary
Code
Related SKOS Predicates for each skos:Concept
Is a Broader Term For 5 skos:broader
Is a Narrower Term Of 6 skos:narrower
Is a Synonym Of 4 skos:altLabel
Should Be Used For 2 skos:prefLabel
Is a Related Term Of/For 8 skos:related
Is a Top Term For 7 skos:topConceptOf &
skos:hasTopConcept

Additionally, each SKOS Concept has an additional predicate skos:inScheme, which simply states that the SKOS Concept is in the SKOSHASSET Thesaurus.

3. Creating and Serializing RDF data

Once we understood the constituent SKOS parts of our desired RDF triples, we wrote a “SkosHassetGenerator” class to iterate through each table row, examine the relationship type and generate the appropriate SKOS triple.  The UK Data Archive is primarily a .NET organisation and we referenced a number of C# libraries from http://www.dotnetrdf.org/ which is open-source and in turn uses JSON.Net for JSON serialization.  The dotNetRDF libraries are well-documented and it this was helpful in generating several serialisations of SKOSHASSET, namely RDF/XML, RDF/JSON, Turtle (which seems to be increasingly popular), NTriples and CSV.

A fundamental part of RDF is a persistent, dereferenceable URI for each SKOS Concept.  In testing we used a local, temporary URI.  We are planning to move to a humanly-meaningful and logical URI on release, however, which will be a subdomain of data-archive.ac.uk.  This is currently being set up and we expect it to be lod.data-archive.ac.uk/skoshasset/<GUID>.  More information on the precise URI will follow.

The “SkosHassetGenerator” class is run as a scheduled console application from our Jenkins server on a daily basis.  As well as generating physical text files on a network share (which is both useful as a snaphot-archiving mechanism and for physical download to end-users), the class writes the triples into a dedicated Triple Store (see below).

We have identified several benefits to our application of SKOS.  Not only does it make the terms easier to maintain and manipulate, but it has also meant that the thesaurus can be more thoroughly validated by 3rd party online tools. One particular favourite of ours is PoolParty.

4. Persisting RDF data

Having generated a SKOS version of the HASSET thesaurus as RDF text files, the next stage was to be able to persist these data in a Triple Store which would allow querying of the data via SPARQL.  As we are primarily a .NET house, we selected BrightStarDB.  This is open source and is itself built on the same dotNetRDF classes we used to generate the triples in the first place.  BrightStarDB also allows us to easily configure a SPARQL endpoint on IIS7 and provides support for Microsoft Entity Framework, which is the Object Relational Mapper we normally use to connect our web services infrastructure to back-end databases.

5. From RDF data to Linked Data

Following on from populating the Triple Store and establishing a SPARQL endpoint, the next stage was to make the SKOS Concept URIs  publicly dereferenceable and useful to the wider user community.  This initially presented us with a headache.  We have approximately 7000 unique terms (or SKOS Concepts) in HASSET.  How do you maintain 7000 persistent identifiers on a web server and deliver both HTML and RDF content to users and machines?  Fortunately, following some research, we identified another open source product, called Pubby, based on Java components, principally Tomcat, Jena and Velocity.  With minor configuration and stylesheet changes, this enabled us to set up a web server to point to our SPARQL endpoint and deliver HTML or RDF content as requested by the end user or machine.

Most of the work has now been completed in terms of applying SKOS to our HASSET Thesaurus.  All that remains is to make the SKOSHASSET SPARQL endpoint and Pubby publicly available for testing.  This will be completed by the end of the project.

Fig 1.  Schematic for generating SKOS Linked Data from HASSET Thesaurus

RDF_schematic

Posted in RDF, SKOS-HASSET | Leave a comment

Licensing for SKOS-HASSET: WP4 deliverable – the SKOS-HASSET Licence Recommendation Report

1.0 Background

The Archive maintains two thesauri: the first, the Humanities and Social Science Electronic Thesaurus (HASSET), is owned entirely by the University of Essex and contains subject terms covering all the social science disciplines; the second, the European Language Social Science Thesaurus (ELSST) takes the core, internationally-applicable terms from HASSET and translates them into a number of European languages. The University of Essex owns the Intellectual Property (IP) in some of this product, but not all.

HASSET, in non-SKOS form (usually as a .csv or a PDF) has been made available for not-for-profit use for many years; however, access to the full set of terms and their relationships has always been granted only after a licence has been signed by the recipient and returned to the UK Data Archive, University of Essex. Indeed, a new licence template was developed in 2011-2012, based initially on the JISC Model Proforma, to ensure that the rights of the University were being adequately protected. It is important to note that in the past this licence has always referred to the intellectual, creative content of the thesauri – the hierarchies and their relationships – and not the database or syntactical structure. A licence has always been applied to the thesauri in the past in order to protect a) the integrity of the terms and b) the quality of the translations. The current licence has a dual purpose: it covers both the use of the thesaurus as an indexing tool as well as regulating its translation into further languages.

Additionally and importantly, the UK Data Archive would like to expand further the membership of the HASSET/ELSST user community (already enhanced during the SKOS-HASSET project). The release of the SKOS product provides an ideal opportunity to further the work of the Archive in communicating with, and learning from, its thesaurus users. To do this, however, the Archive would need to know who its thesaurus users were.

Because of the expected need to maintain the quality and integrity of the thesaurus, this Licensing Report reviews and makes recommendations for the licence conditions under which the SKOS-enabled thesaurus product (SKOS-HASSET) may be delivered.

1.1 Previous IP work

Investigative work covering the Intellectual Property Rights (IPR) and associated licences in use in relation to HASSET and ELSST took place in 2011. In summary, this work found that:

1. the IPR in HASSET are owned by the University of Essex, but the IPR in parts of ELSST are owned by third parties;
2. the licensing system in relation to ELSST is complicated by the need to make provision for quality translations;
3. some users of, and indeed some of the ELSST IPR owners, are outside UK academia.

The University of Essex owns all of the Intellectual Property (IP) in HASSET and a lot of the Intellectual Property in the current version of ELSST. Archive staff undertook some of the original ELSST translation work for the French, German and Spanish translations and created the structure and framework for the thesaurus, which was based on HASSET. It was developed here in the Archive during the LIMBER project (January 2000 – June 2001), but has been developed further since then. As such, the Archive holds the IP in:

• the database structure;
• the thesaurus structure/hierarchies;
• the core terms;
• the English extensions, which include terms specific to British life, government and administration (effectively HASSET);
• the Spanish translations;
• the German translations made up to and including 9 December 2005;
• the French translations made up to and including 30 June 2001.

The remaining IP is held by other, external organisations or individuals. These Intellectual Property Rights relate to:

• the Finnish translations and extensions;
• the Greek translations and extensions;
• the Norwegian translations and extensions;
• the Danish translations and extensions;
• the Swedish translations and extensions;
• the Lithuanian translations and extensions;
• the German translations and extensions to be released in the next version;
• the French translations and extensions made from 1 November 2005.

A brief review of existing licence models was undertaken in May 2011. This review identified other thesaurus products, external to the Archive, with:

• no licences;
• Open Government Licences;
• and more restrictive, detailed licences.

These products were not all SKOS-enabled. All these licence models, as well as the licence template included in the JISC IPR toolkit, were examined.

The JISC model was taken as the starting point in developing a new licence; this licence model ‘contains more favourable provisions than any standard commercial licence for access and use of online resources’ (Korn, 2011). Nonetheless, HASSET and ELSST’s needs were simultaneously not as complex as all of those catered for under the JISC proforma and more complex. The JISC proforma includes provision for institutional responsibilities in relation to resources being made available to staff and students, which a HASSET/ELSST product did not require; on the flipside, any licence covering ELSST and HASSET must also protect not only the integrity and quality of the product when being used, but also when being translated.

As part of this work an initial licence was set up in 2011, based on the JISC exemplum. This has since gone through much iteration. The final licence was approved by the University of Essex’s Research Enterprise Office on 13th March 2012. This is the licence that is currently in use for HASSET and ELSST.

2.0 Copyright issues and thesauri

Copyright law and technology have long been at odds with each other. The Hargreaves report from 2011 includes an entire section on this issue, describing in detail how copyright law has held up technological – and, in some cases, societal – advances. It says:

‘So the question is how to build in sufficient flexibility to realise the benefits of new technologies, without losing the core benefits to creators and to the economy that copyright provides.’ (Hargreaves, 2011)

One of the ways that technologists have tried to address this is through the creation and use of open source systems and the application of Creative Commons licences. This approach works reasonably well; however, there are two key problems in its application to a thesaurus:

1. Creative Commons licences only apply to creative works; they do not cover data or databases (Miller, Styles and Heath, 2008). The question of whether the intellectual property rights in a hierarchy structure would be covered by a Creative Commons licence is an open one.

2. A thesaurus is an authorised set of terms which describe an aspect of the world, society or a discipline. Through their authorisation, they are both descriptive and prescriptive – and this is an important pairing. Thesauri are living, dynamic tools, being updated to reflect changes in the world; however, for these changes to carry authority, they must be made by a single organisation – the thesaurus owner. Each thesaurus should exist in a single, controlled and authorised form. If this does not happen, the integrity of the terms is under question.

Importantly, RDF and SKOS are simply one type of data format that may be applied to a file. Méndez and Greenberg (2012) describe ‘linked open vocabularies as a part of the new knowledge organization ecosystem’. They go on to explain that research has shown that the ‘subject’ or conceptual-type search is the most common type of search on the web. This seems to raise the need for both inter-linked thesauri and controlled vocabularies which try to describe the world, but also for high quality thesauri and controlled vocabularies. Using Linked Data formats is entirely appropriate here; however, in order to maintain the quality of the product, these formats do not necessarily have to be Linked Open Data. Sanchez, Mendez and Rodríguez-Muñoz (2009) explain how the use of SKOS enhances both the user’s and the provider’s experiences: ‘From the user perspective, the use of thesauri developed with the SKOS model affects … those who use them through query operations. For information managers, SKOS offers a closer approach to knowledge organization and management, complementing the automatic extraction of textual content from documents with its indexing through conceptual entities’. There is little doubt that SKOS is the ideal format for index terms both to index conceptually and automatically. The question is how to do this while still maintaining quality control?

A key, recent work on the applicability of various conditions sets to academic tools is the review of the openness of licences undertaken by the Naomi Korn Copyright Agency on behalf of the JISC (Korn, 2011). This review analyses the various licences available to data or resource producers, including the JISC Model Licence, Creative Commons licences, the JISC Collections Open Educational User Licence v 1.0, Open Data Commons and the Open Government Licence.

Of these, the most appropriate licences which could be considered for the SKOS-HASSET project would be the JISC ones, Creative Commons and, possibly, the Open Data Commons licence. The Open Government Licence refers primarily to Government information and so, although a contender, may not be ideal. The JISC Model Licence is the de facto type licence in use currently for HASSET and ELSST as it was taken by the University of Essex as the proforma for the existing HASSET and ELSST licence.

Korn’s work outlines two of the key issues which would prevent the use of Creative Commons Licences in relation to SKOS-HASSET; the review states that:

1. Creative Commons Licences may not be suitable ‘where third-party issues are present and require additional clearance.’
(This is definitely the case for ELSST where the majority of the translations are owned by other, non-UK organisations and individuals.)

2. ‘At a strategic level, committing to the irrevocable terms of CC licences raises issues of broader access and commercial goals for organisations.’
(Once set up, it would be difficult to reverse the terms of Creative Commons licences. While the CESSDA ERIC, the eventual legal entity in relation to European data archives, is still being established, flexibility in terms of being able potentially to change licence conditions is essential.)

The Open Data Commons Licence, set up as an open solution to data or databases (rather than creative works), may be a contender in relation to the HASSET/ELSST database structure, as might the GNU General Public License. GNU is a free, copyleft licence for software and other kinds of works. It is intended to guarantee a permit to the creators and users of a work to share and change all versions of it and to make sure all versions remain free for all. All derived works to have come from a database licensed under GNU must abide by the terms of the original GNU licence.

It is the very inclusion of a database alongside the creative work it holds which makes the use of one or other of these licences problematic though. The big problem with using either of these licences would be the need to include a second licence which covers the intellectual, creative content – the hierarchies – as neither of these licences covers this sort of information. Releasing the thesaurus products under one of these data/database type licences, may result in users taking these to mean that any changes may be made to the terms and re-released under the same terms and conditions. Although the terms of the licences would not permit this, confusion could still ensue. Doing this would also create a multi-licence situation, which would complicate matters, rather than simplify them.

Related to this, Naomi Korn identifies some circumstances in which it may not be appropriate to use open licences. Two of these match the circumstances of HASSET/ELSST; she states that ‘situations where [considering the placing of ‘some’
restrictions upon the user, such as ’No derivative works’ (‘ND’) and/or ‘Non-commercial’ (‘NC’) restrictions] will need to be made include the following’:

1. inclusion of data and/or databases
(one of the key issues here is that the University of Essex owns the IP in the database and syntactical structure);

2. inclusion of third-party-generated content for which permissions have not been cleared
(again, this is very pertinent in relation to ELSST).

Korn suggests that in these circumstances, one could use a licence with a ‘no derivatives’ attribution, a licence with a ‘no commercial use’ attribution or a licence that restricts certain classes of users from being able to access resources. These options, while entirely suitable for the academic community, still prevent access to, or use of, a variety of resources to third parties outside UK education.

The SKOS-HASSET project team does not wish to restrict access to, or even the re-purposing of, its tools and products on the basis of user types or geography, especially as many related thesauri are owned and developed in non-UK and sometimes non-educational arenas. Online browsing of individual concepts and their relationships should be maintained. In terms of the full set of hierarchies, however, the project team requires the ability to maintain the integrity of its products in their entirety and to be offered any derivations. The product may end up being of interest to a community wider than academia and, in fact, colleagues in the commercial, publishing sector have already expressed an interest in it. Rather than restricting users, surely it would be better to make the thesaurus available to anyone and everyone, but behind a simple bespoke and effective licence, even if this is slightly more restrictive than Creative Commons?

Korn supports this. She states that ‘whilst undoubtedly there are numerous benefits associated with the use of ‘open’ licences and the creation of truly Open Educational Resources, which are repurposable and reusable, there are clearly circumstances … where this is not feasible’. Circumstances in which a) the widest community of users, including potentially those from commerce or overseas would be prevented from gaining access to the product, b) the quality and integrity of the product may be jeopardised and c) IPR are shared among a variety of organisations and individuals are likely to be those within which an open licence may not be the best way forward.

2.1 Product integrity

Naomi Korn suggests that where there is a question mark over the use of an open licence, ‘the priorities of the initial licensor of the content need to be based upon an open vs risk evaluation, rather than openness only’ (Korn, 2011). The SKOS-HASSET project has undertaken just such an open vs risk evaluation:

Item Predicted risks with open licence Predicted advantages of open licence Benefit score (1-5, 1=least benefit) Likelihood of risk
(1-5, with 1=least likely)
Severity of risk
(1-5, with 1=least severe)
Risk score Adjusted risk score (Risk score – Benefit score)
Hierarchies may be changed without licensor’s knowledge Terms would lose their integrity; relationships may be broken; SKOS may no longer validate; archives/services using HASSET/ELSST for indexing may not be using the authorised version; multiple versions may proliferate; lack of consistency and harmonisation in versions used would undermine existing, and restrict future, efforts to build tools for multisite use (e.g. discovery of similar/comparable resources in other countries) Product may be used widely; product may be made more appropriate to local needs 5 5 5 25 20
Hierarchies may be added without licensor’s knowledge Relationships may be broken; SKOS may no longer validate; archives/services using HASSET/ELSST for indexing may not be using the authorised version; multiple versions may proliferate; lack of consistency and harmonisation in versions used would undermine existing, and restrict future, efforts to build tools for multisite use (e.g. discovery of similar/comparable resources in other countries) Product may be used widely; product may be made more appropriate to local needs 5 5 5 25 20
Derived version of HASSET or ELSST may be released by third party organisation Terms would lose their integrity; archives/services using HASSET/ELSST for indexing may not be using the authorised version; multiple versions would proliferate; lack of consistency and harmonisation in versions used would undermine existing, and restrict future, efforts to build tools for multisite use (e.g. discovery of similar/comparable resources in other countries) Product may reach previous non-user communities 2 3 5 15 13
New translations may be made of the core terms Quality control of translations would not be made centrally; like-for-like translations may be attempted; authorised version would not receive the new translations Product may be translated into more languages than at present 4 4 5 20 16
Derived version of HASSET or ELSST may be sold by third party organisation Terms would lose their integrity; archives/services using HASSET/ELSST for indexing may not be using the authorised version; multiple versions would proliferate; lack of consistency and harmonisation in versions used would undermine existing, and restrict future, efforts to build tools for multisite use (e.g. discovery of similar/comparable resources in other countries); academic community would suffer; legal proceedings may follow Product may reach previous non-user communities 2 1 5 5 3

Table 1: Open vs risk evaluation

This led on to a further analysis, weighting the risk with any benefits:

 

Item Benefit score

(1-5, 1=least benefit)

Risk score Adjusted risk score (Risk score – Benefit score)
Hierarchies may be changed without licensor’s knowledge 5 25 20
Hierarchies may be added without licensor’s knowledge 5 25 20
Derived version of HASSET or ELSST may be released by third party organisation 2 15 13
New translations may be made of the core terms 4 20 16
Derived version of HASSET or ELSST may be sold by third party organisation 2 5 3

Table 2: Adjusted risk scores

The risks involved with making the thesauri open do not appear to be outweighed by the benefits in this case. The need to retain control of the quality and integrity of the terms, their relationships and the product as a whole is more important than the need to make the product entirely freely available. That is not to say that the product should not be available to all, simply that an element of control, exercised through a licence mechanism, would ensure, in this case, that the product remains of sufficient quality to be useful to all in the future. Any proliferation of derivatives of the thesaurus could, in fact, result in a diluted and less trustworthy product.

3.0 Precedents

A review of the methods of access of existing SKOS products has also been undertaken. The following thesauri have been examined:

• Agrovoc (http://aims.fao.org/standards/agrovoc/about)
• Decimalised Database of Concepts (http://ontologi.es/decimalised/decimalised.rdf)
• Eurovoc (http://eurovoc.europa.eu/drupal/)
• GEMET (http://www.eionet.europa.eu/gemet)
• GeoNames (http://www.geonames.org/)
• IVOAT Thesaurus (http://www.ivoa.net/rdf/Vocabularies/vocabularies-20091007/IVOAT/IVOAT.html)
• Library of Congress Subject Headings (http://id.loc.gov/download/)
• NAL Thesaurus (http://agclass.nal.usda.gov/)
• ONKI portal (http://onki.fi/)
• PICO Thesaurus (http://www.culturaitalia.it/opencms/export/sites/culturaitalia/attachments/thesaurus/4.3/thesaurus_4.3.0.skos.xml)
• RAMEAU (http://rameau.bnf.fr/informations/rameauenbref.htm)
• STW Thesaurus for Economics (http://zbw.eu/stw/versions/latest/about)
• TheSoz (http://www.gesis.org/en/services/research/thesauri-und-klassifikationen/social-science-thesaurus/)

A further list of SKOS-enabled thesauri can be found at http://www.w3.org/2001/sw/wiki/SKOS/Datasets.

Thesaurus Coverage File formats for download Comments Licence
Agrovoc The AGROVOC thesaurus contains more than 40 000 concepts in up to 22 languages covering topics related to food, nutrition, agriculture, fisheries, forestry, environment and other related domains. Supported:
RDF/XML
N-triples
Web Services
Unsupported:
SKOS RDF/XML English only
MySQL
Protégé DB
OWL
Re-authentication takes place using email addresses only (once registered, users need only enter their email address again to gain access a second time) Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
Decimalised Database of Concepts The Decimalised Database of Concepts is a collection of topics suitable for use in linked data. It is inspired by the Dewey Decimal Classification, but no guarantees are made about the closeness of its resemblance as a whole. SKOS mapping links are provided from this database to the Dewey system, to Library of Congree Classification codes and to DBPedia resources where possible. XHTML+RDFa 1.0
RDF/XML
N-triples
Freely available
Eurovoc EuroVoc is a multilingual, multidisciplinary thesaurus covering the activities of the EU and the European Parliament in particular.  It is managed by the EU Publications Office. It contains terms in 22 EU languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish), plus Croatian and Serbian. SKOS/RDF
XML
The licence acceptance is taken off line.  Users must email a dedicated mail box, asking for access to Eurovoc.  Once granted, users are sent a PDF copy of the licence, plus a username and login. EU-specific licence, which must be accepted off-line.
GEMET GEneral European Multilingual Environment Thesaurus. GEMET has been developed as an indexing, retrieval and control tool for the European Topic Centre on Catalogue of Data Sources (ETC/CDS) and the European Environment Agency (EEA), Copenhagen. HTML for import into MS-Access
RDF (themes and groups relationships)
SKOS/RDF (broader/narrower relations)
Freely available
GeoNames The GeoNames geographical database covers all countries and contains over eight million placenames that are available for download free of charge. Free of charge:
individual Gazetteer files (one per place)
Some free post code files
Costed services also exist.
Creative Commons Attribution 3.0 License
IVOAT Thesaurus Astronomical terms. RDF
Turtle
Freely available
Library of Congress Subject Headings Library of Congress Subject Headings (LCSH) has been actively maintained since 1898 to catalogue materials held at the Library of Congress. LCSH in this service includes all Library of Congress Subject Headings, free-floating subdivisions (topical and form), Genre/Form headings, Children’s (AC) headings, and validation strings for which authority records have been created. The content includes a few name headings (personal and corporate), such as William Shakespeare, Jesus Christ, and Harvard University, and geographic headings that are added to LCSH as they are needed to establish subdivisions, provide a pattern for subdivision practice, or provide reference structure for other terms. RDF/XML
Turtle
N-triples
Freely available
NAL Thesaurus The USA’s National Agriculture Library (NAL) thesaurus and glossary are online vocabulary tools of agricultural terms in English and Spanish and are cooperatively produced by the NAL, US Department of Agriculture, and the Inter-American Institute for Cooperation on Agriculture through the Orton Memorial Library, the Mexican Network of Agricultural Libraries (REMBA), as well as other Latin American agricultural institutions belonging to the Agriculture Information and Documentation Service of the Americas (SIDALC). XML
RDF-SKOS
PDF
MARC
DOC
User must click to accept the Usage conditions. NAL-specific terms and conditions of use
ONKI portal The ONKI service contains Finnish and international ontologies, vocabularies and thesauri needed for publishing content on the Semantic Web. OWL
RDF/XML
Turtle
SKOS-RDF
RDFS
ONKI provides access to many different vocabularies and ontologies. Files either freely available or behind a Creative Commons 3.0 License
PICO Thesaurus The Dictionary of Italian Culture is a controlled vocabulary designed for subject indexing and classification of heterogeneous resources, sourced from different cultural contexts. XML The XML has been put through an RDF to HTML stylesheet so is both humanly readable on screen and may be saved as RDF.  Nothing needs to be agreed before viewing or saving become possible. Creative Commons 2.5 License (Italian)
RAMEAU RAMEAU consists of a vocabulary of interconnected terms and an indicative syntax for the construction of subject headings.  It includes a set of authority records (common names and geographical entities). RAMEAU is enriched progressively through proposals from its network of users. None The web site states:
‘Fournitures de fichiers de données brutes: Des fichiers de données brutes peuvent être fournis, portant : soit sur des produits courants : les notices RAMEAU (telles que définies dans Périmètre des autorités RAMEAU), créées, modifiées (à l’exception des modifications induites par des liens dans d’autres notices) et annulées pendant la période considérée ; soit sur des produits rétrospectifs : l’ensemble des notices RAMEAU (telles que définies dans Périmètre des autorités RAMEAU) à une date donnée.’  (=‘Availability of raw data files:  raw data files can be supplied that relate either to:current products : RAMEAU records (as defined in ‘Scope of RAMEAU authority records’), created, modified (except for changes caused by links in other records) and cancelled during the period in question; or retrospective products – all RAMEAU records (as defined in ‘Scope of RAMEAU authority records’) at a given date.)[1]
The user is directed to the Produits et services bibliographiques page of the Bibliotheque Nationale de France (BNF), which provides a further link to Produits bibliographiques.  This last page includes a price list.  Autorites RAMEAU for example costs between E1,000 and E1,196 per installation.
BNF-specific licence, which must be accepted off-line.
STW Thesaurus for Economics This thesaurus relates primarily to economics; it contains more than 6,000 standardized subject headings and about 19,000 entry terms to support individual keywords. It also contains technical terms used in law, sociology, or politics, as well as geographic entities. RDF/XML
Turtle
N-triples
Optional email list available to sign up to for announcements relating to the thesaurus but no need to sign up prior to download. Creative Commons Attribution-Noncommercial-Share Alike 3.0 Germany License
TheSoz The Thesaurus for the Social Sciences (Thesaurus Sozialwissenschaften) is used for keyword searching in SOFIS (Social Science Research Information System) and SOLIS (Social Science Literature Information System). The list of keywords contains about 12,000 entries, of which more than 8,000 are descriptors (authorised keywords) and about 4,000 non-descriptors. Topics in all of the social science disciplines are included.  It has been translated into French and English. SKOS-XL
RDF/XML
Turtle
Users must complete an online form prior to download.  This captures name, organisation and email address.  Nothing needs to be agreed in advance.
As well as the downloadable data, TheSoz is available in a Linked Data HTML representation.  A SPARQL endpoint (using Pubby) is also available as a technical interface.
Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Germany License.

[1] Translation provided by Dr Lorna Balkan, 17 October 2012.

Table 3: Licence arrangements for other thesauri

Of the twelve thesauri included above, five (42%) have Creative Commons licences, four (33%) are freely available and three (25%) maintain their own, bespoke licensing arrangements. (Onki provides access to thesauri and other tools that are either freely available or protected by Creative Commons.) It is clear that the licensing landscape contains many, varied features, with different licence models in operation.

Recommendations

Following the brief review of existing work in this field, of risks and precedents and of the particular needs of a HASSET/ELSST product, this report makes the following recommendations:

 a bespoke licence will be used, in preference to Creative Commons licences, which are a) too broad, b) cannot be reversed and c) difficult to apply in a multi-IP situation;

• the existing licence in use for HASSET will continue to be used; this has the following advantages:

• it is a single licence which can be adapted for both HASSET and ELSST and any combinations of their shared terms;

• the licence does not currently prevent profit-making, non-UK or non-academic usage, but, through the offline completion of its custom fields, has the flexibility to allow all users to gain access to the product;

• it does not permit the thesaurus being passed on to third parties;

• it already includes the requirement that all translations must be made against the British English core terms from the Humanities And Social Science Electronic Thesaurus (HASSET), and plans are already in place, via additional resources, for the management of such translations;

• each concept, encoded in RDF, will be browseable via Pubby, which will also contain copyright information in its Dublin Core metadata and log accesses via Google Analytics (capturing domain names/IP addresses);

• each term and hierarchy will be browseable via the humanly-readable web pages which will also contain copyright information in its page metadata and displayed on screen, and log accesses via Google Analytics (capturing domain names/IP addresses);

• non-profit making users will be able to download the full thesaurus online by joining the HASSET/ELSST community (involving simple authentication and agreeing to our terms and conditions);

• profit-making users and those wishing to make new translations will be able to gain access to the product through the offline completion of the licence;

• the system to be developed for non-profit making users is envisaged as an amalgamation of the NAL and Agrovoc systems, via the collection of the potential user’s details, including:

• name;
• email address;
• position;
• organisation;
• purpose to be made of the thesaurus;
• date;

and the pre-population of the custom fields of the licence with the information supplied by the user; the user will also be required to click to show that they have agreed to the licence’s terms and conditions;

• the licence will run for twelve months or part thereof to the date of next release from date of acceptance; licence renewals will be generated via an alerting service (to be established post-project) and will be tied in with the date of annual release of new versions of the thesaurus;

• licence-holders will become members of the HASSET/ELSST community and will be invited to share their uses of the thesauri and to comment on their development in a variety of ways (for instance, via workshops or an annual conference if these are viable, and via the blog);

• simple and appropriate authentication should be employed to verify the user’s email address (e.g. via an automated message, containing a link which must be activated and/or using Shibboleth or similar); the precise form of authentication is yet to be decided.

These recommendations, once implemented, should allow for a simple, yet regulated system, which is a compromise between an entirely open system and one which only permits use of the thesaurus after a signature has been received. It employs a degree of trust in relation to users’ acceptance of the not-for-profit terms and conditions, but also allows the University to maintain the quality and integrity of the product and to track its thesaurus users and their usage of its product. Finally, and importantly, it provides the licence-holders both with the ability to browse the thesauri freely online and to become part of a wider community of thesaurus users.

References

Blumauer, A. (2010) ‘Why SKOS thesauri matter – the next generation of semantic technologies’. Semantic Web Company blog post, 31 August 2010 [http://blog.semantic-web.at/2010/08/31/why-skos-thesauri-matter-the-next-generation-of-semantic-technologies/]

Hargreaves, I. (2011) Digital opportunity: a review of intellectual property and growth. [Newport, South Wales, UK Intellectual Property Office]

Korn, N. (2011) Overview of the ‘Opennness’ of licences to provide access to materials, data, databases and media. JISC/Naomi Korn Copyright Consultancy, January 2011.

Méndez, E. and Greenberg, J. (2012) Linked Data for Open Vocabularies and Hive’s Global Framework, El Profesional de la Información; May/Jun2012, 21 (3), pp. 236-244.

Miller, I., Styles, R. and Heath, T. (2008) ‘Open data commons : a licence for open data’, LDOW2008, 22 April 2008, Beijing, China. [http://events.linkeddata.org/ldow2008/papers/08-miller-styles-open-data-commons.pdf]

Pastor-Sanchez, J., Mendez, F. J. M. and Rodríguez-Muñoz, J. V. (2009) ‘Advantages of thesaurus representation using the Simple Knowledge Organization System (SKOS) compared with proposed alternatives’, Information Research, 14 (4), paper 422.

Posted in Access, Licensing, Project Management | Leave a comment

Language and Computation Day 2012, University of Essex

Lucy Bell and Mahmoud El-Haj, members of the SKOS-HASSET project team, recently presented the work of the project at the University of Essex’s 2012 Language and Computation Day (held on 4 October 2012).  The University’s Language and Computation Group is an inter-disciplinary research group, containing members drawn from a number of departments, including Language and Linguistics, Computer Science and Electronic Engineering as well as the UK Data Archive.  It organises inter-disciplinary meetings, as well as inviting external speakers to the University.

Lucy presented an overview of the work of the project: SKOS-HASSET: a project at the UK Data Archive. The presentation described the headline objectives of applying SKOS to HASSET, testing this via automatic indexing, investigating licensing and improving the user interfaces.  Lucy also gave a summary of the progress that the project has made.  We are now halfway through the contract and have achieved the following so far:

  • SKOS has been applied to HASSET
  • A system for the re-application of SKOS to other hierarchies has been established internally
  • The texts have been prepared for the automated indexing case study and two corpora (catalogue records and SQB questionnaires) have already been automatically indexed
  • The gold standard of manual indexing of questions is taking place (almost 21,000 questions have been indexed so far)
  • An evaluation timetable, incorporating both automatic and manual evaluation of the automatic indexing results, has been drawn up
  • The research on licences is well under way and the licensing report is expected soon
  • Initial requirements for user and management interfaces have been drafted
  • The project has been promoted via this blog and at conferences (see earlier blogs!)

Mahmoud presented a thorough review of the data mining work undertaken in Work Package 2: Keyword Indexing with a SKOS Version of HASSET Thesaurus.  Mahmoud’s presentation described the work to apply the automatic indexing to the four corpora which we are targeting (catalogue records, SQB questionnaires, full text case studies and support guides and questions/variables taken from Nesstar).

Many interesting questions were fielded, and suggestions given about how to extend work in this area, post-project.  A short debate was held about whether it would be possible to apply automatic indexing to the data within the Archive collections; however, it was concluded that questions of copyright, disclosure and data protection would not permit this.  Good ideas were also received from colleagues in other faculties regarding ways of extending HASSET via suggestions of new terms provided via automatic indexing.  These will be further examined as part of the sustainability work of the project.

Posted in Communication, Project Management, Text mining | Leave a comment

UKDA Keyword Indexing with a SKOS Version of HASSET Thesaurus

Mahmoud El-Haj

1. Introduction

Searching data collections has become a common and important process to computer users. Online search engines nowadays provide satisfying results for users’ queries but this becomes a challenge when a domain specific search is to be performed. Selecting accurate search terms is the main problem for domain specific automatic search. One solution is manual indexing by professional indexers or sometimes by authors, whereby these people select those keywords that will provide accurate and quick content-based access to the data collection. With the increasing amount of information available on the internet and the rapid growth of domain specific intranets, manual indexing is getting slower and more expensive. Automatic indexing on the other hand is faster and cheaper and can deal with the vast amounts of information. Automatic indexing processes huge amounts of information in a feasible time and with least effort. Our project is examining both the efficiency and the accuracy of keywords automation. In our work we are testing the capacity and quality of automatic indexing using a controlled vocabulary (thesaurus) called HASSET (Humanities and Social Science Electronic Thesaurus). HASSET is taken as the vocabulary source for the automatic indexing task. Automatic indexing is being applied to the UK Data Archive’s collection. The automatic indexing will provide a ranked list of candidate keywords to the human expert for final decision-making. The accuracy or effectiveness of the automatic indexing will be measured by the degree of overlap between the automated indexing decisions and those originally made by the human indexer (gold standards). This blog describes the work we have undertaken so far in the automatic indexing task. Further blog posts will be published once our results are ready.

2. Problem and Motivation

For many years the indexing process of the Archive’s collection has been done manually by professional human indexers. The process begins with the indexer going through study documentation, usually the questionnaire or interview schedule, though accompanying quantitative data files are also checked for derived variables not covered in the questionnaire. Keywords that represent the topics covered by the study are chosen and their best match is selected from the HASSET thesaurus Attention is paid to terms used over time within data series and across similar studies to ensure consistency of subject/keyword coverage within the collection. The processing time varies considerably depending on the size and complexity of the study. The keywords selected from HASSET stand as the high quality (gold) standard that can be used as training data for automatic indexers and as a comparator. As a result of the indexing process, new keywords may be recommended for addition to HASSET if needed.

3. Manual and Automatic Indexing

Indexing has been used for a long time as an efficient means of accessing information (i.e. indexes in books, telephone directories, libraries, etc.). Indexing can be done either manually or automatically. Manual indexing is a time consuming process. Automatic indexing, therefore, has been widely used in experimental systems such as information retrieval and information extraction [Croft et al. 2009]. Automatic indexing, in recent years, has shown to be labour-saving, more efficient and consistent than manual indexing [White et al. 2012, Hliaoutakis et al. 2006].

As shown in Figure 1, the automatic indexing process starts with automatically selecting documents from the collection set (documents that are not yet indexed). During this process information about the document is sent to the index (i.e. database), including the document’s title, size, location and genre if present. The selected documents are split into sentences using delimiters (e.g. full-stop, question-mark, and exclamation-mark). Finally those sentences are split into tokens (i.e. words) based on delimiters (e.g. white space). The tokens are indexed and information about each token’s location, position, frequency and weight are recorded. In our work the indexing process works with a controlled vocabulary. Controlled vocabulary mandate the use of predefined, authorised terms that have been preselected by the designer of the vocabulary (i.e. HASSET), in contrast to natural language vocabularies, where there is no restriction on the vocabulary.

Figure 1 Index Creation Process

4. Data Collection

The data collection used in our work is that from the UK Data Archive. This project will not look at the quantitative data collection themselves but rather at the contextual textual documents that accompany them. The collection provides documents and their gold standard keywords to be used for training and testing the automatic indexer. The material includes:

  • A bank of variables/questions (individual variables from Nesstar[1] indexed, each with HASSET terms specific to themselves).
  • Survey Question Bank (SQB) questionnaires[2].
  • ESDS data catalogue records[3].
    • Abstracts (from all catalogue records).
    • Full catalogue records (dating from 2005).
  • Other full-text documents.
    • Case studies[4].
    • Support guides[5] .

5. Data Collection Pre-processing

Automatic indexing requires a number of steps before the process can start. These steps are document formatting and metadata extraction. The first step converts the data collection from portable document format (PDF) to unstructured text. This is done by running a PowerShell wrapper script of the Linux open source “Xpdf” software[6]. The script uses “pdftotext” converter to crawl through the data collection and converts the content of PDF files into text format “.txt”.

The second phase extracts metadata from the PDF data collection. The process includes using “Xpdf” software through applying “pdfinfo” script. We have used pdfinfo as part of a PowerShell script called “extractTAGS.ps1” to extract the metadata tags and store them in txt and xml file formats. The script extracts metadata tags such as the document’s title, keywords, producer, modification and creation date, etc. In our work we are mainly interested in the keywords attached to the data collection as those represent the gold standard. The extracted keywords are saved into a “.key” file format preserving the original file name as with the “.txt” files. The script, extractTAGS.ps1 was written as part of this work. Figure 2 shows the automatic indexing overall process. Figure 3 shows the system deployment process (all of the activities that make a software system available for use) of applying automatic indexing where no training and evaluation are required. Both the overall and the deployment processes start with getting the pdf files and converting them into text. Extracting the manual keywords is needed for the automatic and manual evaluation steps as in Figure 2. Automatic indexing refers to the selection of keywords from a document by a computer to be used as index entries (see Section 3).

Figure 2 Automatic Indexing Overall Process

Figure 3 Automatic System Deployment Process

6. Experimental Work
In our work so far we have investigated two text mining techniques to automatically index the data collection. The techniques used are the TF.IDF model and Keyphrase Extraction Algorithm (KEA) [Medelyan and Witten, 2006].
The work began by applying SKOS[7] (Simple Knowledge Organization System) to HASSET.
We have taken SKOS-HASSET and applied it to a selected, representative part of the ESDS collection (see Section 4) to test its automatic indexing capabilities.

6.1 TF.IDF Model
One of the most common models to compute term weight is the tf.idf weighting, which makes use of two factors: the term frequency tf and the inverse document frequency idf [Salton and McGill, 1986]. The weight w of term j in a document i can be computed as:

where

and tfij , the term frequency, is the number of times that term j occurs in document i, n is the number of documents in the collection, and dfj is the document frequency of term j [Grossman and Frieder, 2004].

The information in the index will be used in the ranking process where information about each token’s (or a subsequence of words) weight will be used to calculate the similarity. In general, weights reflect the relative importance (based on factors such as the term-frequency of this token) of the extracted tokens in documents.

We used the tf.idf model to extract keywords from the SQB questionnaires (see Section 4). We processed 2436 SQB documents. We did not use any training data nor did we use a controlled vocabulary as the tf.idf model is a basic model that does not require training. The reason behind using tf.idf is the fact that no training data was needed. We asked the system to extract keywords limiting the number of subsequence words in each keyword to three (e.g. “British Conservative Party”). Although HASSET terms can contain up to five words (e.g. “LIBRARY, INFORMATION AND ARCHIVAL SCIENCES EDUCATION”) we were not able to process more than three subsequent words as this could result in a time infeasible process.

Indexing a domain specific data collection (i.e. Economic and Social Science) using tf.idf model resulted in extracting keywords that carry low domain-specific information content. When mapping the extracted keywords to the HASSET terms we had a very few matches. For example, the tf.idf system failed in finding matches to the keyword “Liberal Party” although it exists in HASSET but in a different form “BRITISH LIBERAL PARTY”, “LIBERAL PARTY (GREAT BRITAIN)”. To overcome this problem we used domain-specific keywords indexing with controlled vocabulary using HASSET thesaurus. Another way round this would be to match sub-parts of HASSET terms, or ignoring qualifier of HASSET terms (i.e. the part in brackets). This way the process extracts keywords that only exist in the HASSET thesaurus.

6.2 Keywords Indexing using Controlled Vocabulary

Indexing documents requires selecting the keywords from the controlled vocabulary that best describes a document. Automatic indexing requires training data consisting of documents and their associated keywords. The training data are used to build a classifier model for each keyword. The model is then applied to new (previously unseen) documents to assign keywords. The training data set is chosen based on the percentage of keywords’ coverage when compared with HASSET terms.

In our work we used the Keywords Extraction Algorithm (KEA). Figure 4 illustrates the keywords extraction process, which contains two stages (1) training and (2) keyword extraction (testing). Figure 5 shows the system deployment process where no training and testing are needed. The models created during the training phase are used in the extraction phase. The testing data are now the actual data that need to be automatically indexed to extract keywords. The algorithm is based on machine learning and works in two main steps: candidate term identification, which identifies thesaurus terms that relate to the document’s content; and filtering, which uses a learned model (using our training data) to identify the most significant keywords based on certain properties or “features” which includes the tf.idf measure, the first occurrences of a keyword in a document, the length of keywords (e.g. two words) and node degree (the number of keywords that are semantically related to a candidate keyword) [Witten et al., 2005]. The machine learning scheme first builds a model using training documents with known keywords, and then uses the model to find keywords in new documents. KEA uses the latest version of the Weka machine learning workbench, which contains a collection of visualisation tools and algorithms for data analysis and predictive modelling [Witten and Frank, 2000]. Weka supports several standard data mining tasks, more specifically, data pre-processing, clustering, classification, regression, visualization, and feature selection. Both KEA and Weka are open source software under the GNU General Public Licence. In our work we rely on Weka’s feature selection functionality which is used by KEA to measure the importance of a certain keyword.

Figure 4 Keywords Extraction Process using KEA

Figure 5 Keywords Extraction System Deployment using KEA

6.2.1 KEA Algorithm

The keyword extraction process has two stages [Witten et al., 2005]:

1- Training: create a model for identifying keywords, using training documents where the human indexer’s keywords are known. We use the UK Data Archive’s questionnaires (SQBs). We perform a number of runs where we try different training sets (based on the percentage of keywords coverage) in each run. The total number of the SQB documents is 2436. We have selected 80% (1948 documents) of the data collection for training. And another 20% (488 documents) for testing. The controlled vocabulary used is SKOS-HASSET (see Section 6). A common English stop word list has been augmented with domain specific terms.

2- Keyword Extraction: choose keywords from a new (test) document.

Figure 6 shows the training and extraction stages. Both stages choose a set of candidate keywords from the input documents and then calculate the values of the features mentioned earlier. The example shown in the figure is taken from agricultural documents using the Agrovoc[8] thesaurus.

Figure 6 KEA Training and Extraction Processes[9]

Candidate Keywords

KEA chooses candidate keywords in three steps [Witten et al., 2005]:

1- Input Cleaning

· Replace punctuation marks, brackets and number with phrase boundaries.

· Remove apostrophes.

· Split hyphenated words into two.

· Remove non-token characters (that do not contain letters).

2- Candidate Identification

· Candidate phrases are limited to a certain maximum length (usually three words); unlike the tf.idf experiment (see Section 6.1) using KEA we set a maximum length of five words.

· Candidate phrases cannot be proper names (i.e. single words that only ever appear with an initial capital).

· Candidate phrases cannot begin or end with a stop-word.

3- Phrase stemming and case-folding (the distinction between upper and lower-case)

Stemming determines the morphological stem of a given inflected word. It uses morphological rules or heuristics to remove affixes from words before indexing. Stemming reduces the number of indexing terms and reduces redundancy (by collapsing together words with related meanings). Stemmers are common elements in information retrieval systems, since a user who runs a query on ”computers” might be also interested in documents that contain the word “computer” or “computerization” [Croft et al., 2009]. Stemming is only applied to the document’s terms, where terms with related meanings are grouped together.

Feature Calculation

KEA uses Weka to calculate four features for each candidate phrase. Which are used in the training and extraction process. The features are “TF.IDF” and “First Occurrence”, “Length” and “Node degree”.

· TF.IDF compares the frequency of a phrase’s use in a particular document with the frequency of the phrase in general use (see TFIDF Method).

· First Occurrence is calculated as the number of words that precede the phrase’s first appearance, divided by the number of words in the document. The result is a number between 0 and 1 that represents how much of the document precedes the phrase’s first appearance.

· Length of a keyword is the number of its component words.

· Node degree of a candidate keyword is the number of keywords in the candidate set that is semantically related to this keyword. This is computed with the help of the thesaurus. Phrases with high degree are more likely to be keyphrases.

Training: building the model

The training stage uses a set of training documents for which the author’s keywords are known. For each training document, candidate phrases are identified and their feature values are calculated as described above. Each phrase is then marked as a keyword or a non-keyword, using the actual keywords that have been manually assigned to the document. This binary feature is the class feature used by the machine learning scheme.

Extraction of new keywords

To select keywords from a new document, KEA determines candidate phrases and feature values, and then applies the model built during training. The model determines the overall probability that each candidate is a keyword, and then a post-processing operation selects the best set of keywords using a Naïve Bayes model (a simple probabilistic model based on applying Bayes’ theorem (from Bayesian statistics) with strong (naive) independence assumptions) [Chen et al. 2009].

7 Evaluation

Keyword quality is measured by counting the number of matches between KEA’s output and the keywords that were originally chosen by the professional human indexer. We also use recall and precision to assess the effectiveness of the automatic keyword extraction process.

Precision: the fraction of retrieved instances that is relevant. Precision can be interpreted as the number of keywords judged as correct divided by the total number of keywords assigned. Recall: the fraction of relevant instances that are retrieved. Recall can be interpreted as the number of correct keywords assigned, divided by all keywords deemed relevant to the object. F1 score (or F-measure, F-score): F1 score considers both recall and precision to measure accuracy. It can be interpreted as a weighted average of the precision and recall (best value 1, worst 0).

8 Results

The system was trained using 1948 SQB documents. Table 1 shows the preliminary recall, precision and F1 scores of applying KEA algorithm on 488 test SQB documents. The recall and precision were measured by comparing the extracted keywords with the keywords that have been manually assigned to the document.

As shown in the table, increasing the number of automatically extracted keywords increases the recall but, at the same time, affects the precision.

Table 1 Keyword Extraction Scores for SQB Documents

Number of Keywords Recall Precision F-Measure
10 0.07174 0.13988 0.08498
20 0.11005 0.12151 0.10616
30 0.13500 0.10906 0.11255

Additional evaluation methods, as described in our evaluation plan, will be applied as the next step in our work. We are in the process of cleaning the data, which will, hopefully, help in producing more accurate results. We will also compare our results with those achieved by third parties, where possible and appropriate.

Conclusion

In this work we test SKOS-HASSET’s automated indexing capacity. SKOS-HASSET was taken as the terminology source for an automatic indexing tool and applied to question text, abstracts and publications from the Archive’s collection. The results were compared to the gold standard of manual indexing. Limitations of this approach include the small-sized training data. The use of HASSET thesaurus vocabulary limits the approach to data that falls in the Humanities and Social Science domains, but this can be solved by including vocabularies from other domains.

Future blog posts will be published as this work package. This customer blog post will also form the basis for our automated indexing exemplar / case study.

References

A. Hliaoutakis, K. Zervanou, E. G.M. Petrakis, and E. E. Milios. 2006. Automatic document indexing in large medical collections. In Proceedings of the international workshop on Healthcare information and knowledge management (HIKM ’06). ACM, New York, NY, USA, 1-8. DOI=10.1145/1183568.1183570 http://doi.acm.org/10.1145/1183568.1183570

B. Croft, D. Metzler, and T. Strohman. Search Engines – Information Retrieval in

Practice. Pearson Education, 2009. ISBN 978-0-13-136489-9.

D. Grossman and O. Frieder. Information Retrieval: Algorithms and Heuristics. The Kluwer International Series of Information Retrieval. Springer, second edition, 2004.

G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986. ISBN 0070544840.

H. White, C. Willis, and J. Greenberg. 2012. The HIVE impact: contributing to consistency via automatic indexing. In Proceedings of the 2012 iConference (iConference ’12). ACM, New York, NY, USA, 582-584. DOI=10.1145/2132176.2132297 http://doi.acm.org/10.1145/2132176.2132297

I.H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco.

I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Y.-L. Theng and S. Foo, editors, Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pages 129–152. Information Science Publishing, London, 2005.

J. Chen, H. Huang, S. Tian, Y. Qu, Feature selection for text classification with Naïve Bayes, Expert Systems with Applications, Volume 36, Issue 3, Part 1, April 2009.

O. Medelyan , I.H. Witten, Thesaurus based automatic keyphrase indexing, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA [doi>10.1145/1141753.1141819]

Posted in Indexing, Text mining | 1 Comment

CILIP CIG Conference, 10-11 September 2012, University of Sheffield

Lucy Bell, SKOS-HASSET’s Project Manager, attended the CILIP Cataloguing and Indexing Group’s 2012 conference last week (10th – 11th September).  This gave a good opportunity to network with the librarian community – and in particular indexers who may find a use for SKOS-HASSET.

Lucy presented a paper on ‘Getting metadata to work harder’ and talked about the work of the SKOS-HASSET project, including how the UK Data Archive has used HASSET in the past and how this may change in the future.  The paper was received well and has been followed up since with some useful email conversations (and tweets!).

The conference itself provided some interesting food for thought, highlighting especially how cataloguing workflows in many libraries will need to be more streamlined in the future.  Cataloguers and indexers are looking for tools which will help them work more efficiently but which will still enable them to apply quality control.

Papers were given on automation techniques (such as that from Gary Green of Surrey County Council Libraries) and on cataloguing wider materials (such as that from Helen Williams of the London School of Economics and Political Science, who spoke about cataloguing institutional repository items).  These allowed us to compare our indexing work within the SKOS-HASSET project, and more widely, with that from other libraries and information services.  The LSE’s use of wikis for sharing detailed instructions (and standardised terms) was particularly interesting.

Extensive work was also presented on the new library Resource Description and Access (RDA) standard and on the Functional Requirements for Bibliographic Records (FRBR) model, both of which take a more holistic approach to the application of index terms, promoting the re-use of terms.  This places indexing in a more ontological environment, allowing the relationships between entities, subjects and concepts to be made more explicit and thus permitting more connections to be made.  This seems to tie in neatly with RDF and our work to apply SKOS to HASSET, which in itself will make this tool more flexible.

Lucy Bell

Posted in Communication, SKOS-HASSET | Leave a comment

SKOS-HASSET Evaluation Plan

Lorna Balkan and Mahmoud El-Haj

1. Aim of the text-mining task

This blog post will describe our evaluation plan.  The aim of the SKOS-HASSET text mining task is to investigate the potential of SKOS-HASSET by applying it automatically, using a variety of techniques and tools, to the selected contents of the UK Data Archive’s collections and comparing the results against:

(1)    The manually-indexed keyword gold-standards.

(2)    Other organisations’ use of HASSET for automatic indexing, where comparisons can be made and the third parties are willing.

Testers:  Archive indexers and appropriate stakeholders.

General questions we want to answer:

  • How well do the text mining tools we use compare to human indexers?
  • Which text mining tools perform best?
  • Which test collection do the tools work best on (and why)?
  • How does the best automatic indexing tool perform on Archive texts compared to the third party tools on their texts? (Qualitative/opinion-based judgement.)

2. HASSET

The full Humanities and Social Science Electronic Thesaurus (HASSET) consists of  12,233 hierarchically arranged terms, including 7,634 descriptors or preferred terms (which are used for indexing) – the rest are synonyms or non-preferred terms.   Geographical are excluded from the text mining exercise, leaving a total of 8,830 terms (preferred and non-preferred).

Anticipated problems and challenges:

  • There are a large number of closely related terms in full HASSET (this may cause problems for the machine learning algorithm to discriminate).
  • Difference in level of abstraction; some terms (e.g. STUDENT SOCIOLOGY) are very abstract and are unlikely to appear verbatim in texts, whereas others (e.g. MENTAL HEALTH) are much less abstract and more widely used.
  • Difference in size and format; terms can be single words or multi-word units, and may contain qualifiers (e.g. ADVOCACY (LEGAL)) which may not appear verbatim in texts.
  • Synonymy: different words with identical or very similar meanings. Thesauri control synonymy by choosing one word or term as the preferred term and making its synonyms non-preferred terms. Mapping synonyms to their preferred terms is as a challenge for automatic indexers.
  • Polysemy: the coexistence of many possible meanings for a word or phrase. Like other thesauri, HASSET controls polysemy by restricting the meaning of its terms to avoid ambiguity. For example, the meaning of the term “COURTS” in HASSET is restricted to mean “LAW COURTS”, given its position in the hierarchy “ADMINISTRATION OF JUSTICE”.  Polysemy is a challenge for automatic indexing.  Supervised machine learning algorithms with features for disambiguation have been successful in tackling this problem (see Dash 2002, 2005, 2008).
  • Some HASSET terms are used in a very particular sense in the thesaurus (e.g. “SCHOOL –LEAVING” versus “SECONDARY SCHOOL LEAVING”). The scope note for “SCHOOL-LEAVING” says: “USE FOR LEAVING SCHOOL UP TO COMPLETION OF COMPULSORY EDUCATION. FOR SCHOOL LEAVING UPON COMPLETION OF COMPULSORY EDUCATION, USE THE TERM SECONDARY SCHOOL LEAVING“.  Again, this presents a challenge for machines.
  • Plural form: the convention in HASSET is to use the plural form of count nouns (e.g. TOWNS, not TOWN), while both singular and plural forms are found in texts.
  • Spelling variants: many words can have different spellings. The use of -ization versus -isation is an example, as is the use of hyphenation. HASSET terms ending in -ization should match words ending in both -isation and –ization.
  • Some HASSET terms are used chiefly as placeholders in the thesaurus , with a scope note to say use a more specific term instead (e.g. “RESOURCES”, which has the following scope note: “AVAILABLE MEANS OR ASSETS, INCLUDING SOURCES OF ASSISTANCE, SUPPLY, OR SUPPORT (NOTE: USE A MORE SPECIFIC TERM IF POSSIBLE). (ERIC)”). These terms may therefore be assigned more often by the automatic indexer than by the human indexer.

3. Corpora

We are currently in the process of preparing and describing the corpora we intend to use for the automatic indexing process. The material includes:

  1. The bank of variables/questions (individual variables indexed, each with HASSET terms specific to themselves).
  2. Survey Question Bank (SQB) questionnaires.
  3. ESDS data catalogue records.
    1. abstracts (from all catalogue records).
    2. full catalogue records (from Study Number 5000 onwards: these are the most recent catalogue records, dating from 2005).
    3. Other full-text documents.
      1. case studies.
      2. support guides.

The first corpus (bank of questions) is currently being indexed manually. The fourth corpus (case studies and user guides) have been indexed using UK Data Archive subject categories, which need to be mapped to HASSET terms. This work is also ongoing.  Corpus 3 (catalogue records and documentation) contains HASSET index terms derived from the data and documentation.  Corpus 3a (abstracts) have not been indexed separately from the rest of the data and documentation. The aim here is to see how the terms that the automatic indexer suggests for these records match up with the manually indexed terms applied in relation to the data.  Will it be possible to use documentary evidence and text associated with data to generate effective and useful HASSET terms?

Possible problems and challenges for the text mining task include:

  • The difference in the size of corpora subtypes (and within some subtypes, e.g. case studies).
  • The different number and type of terms assigned to each subtype of corpus (all corpora  have been indexed either with the full set of HASSET terms, or with the smaller subset that has been mapped to UK Data Archive subject categories).
  • The large variation in the number of terms assigned within some subtypes of corpus, (e.g. catalogue records, where the number of terms assigned ranges from 3 to 468).
  • Some HASSET terms are very commonly used, while others hardly used– rare terms will be harder to train.
  • Older documents have been indexed with an older version of HASSET.
  • Some older documents contain OCR’ed files, which will be harder for the automatic indexer to process
  • Most corpora contain a degree of spelling errors.

3.1 Training versus test corpora

For supervised machine learning tasks, each corpus needs to be divided into a training corpus and a test corpus. The automatic indexer is trained on previously indexed material (the training corpus) and then tested on new, unseen test material (the test corpus). Since our corpora are all somewhat different, we have decided to use separate training corpora for each sub-corpus.

4. Evaluation criteria

4.1. Overview

There are different ways of evaluating text mining systems.  For example, we can look at ‘usability’ testing which covers functionality, reliability, usability, efficiency and maintainability of the system, and is concerned with the usefulness of a system to its users (see Ananiadou et al. 2010).

In this project we are only concerned with evaluating accuracy, however, which ‘only tells us how well the system can perform the tasks that it is designed to perform, without considering methods of user interaction’ (Ananiadou et al. 2010).   We are not creating a tool or user interface for applying terms automatically, but rather testing some functionality. Any evaluation of user interaction is out of scope.

In terms of accuracy, we can evaluate either the accuracy of the system in the indexing task, or, alternatively, the accuracy of the system in the information retrieval task.  In this project, we are only interested in accuracy of the system with regard to the indexing task.

The accuracy or effectiveness of a classifier will be measured by the degree of overlap between the automated classification decisions and those originally made by the human indexer (the ‘gold standard’).

A classifier can make either a ‘hard’ classification decision (i.e. take a binary decision, where a keyword is assessed as either relevant to a document or not) or a ‘soft’ classification decision (i.e. where a keyword is assigned a numeric score (e.g. between 0 and 1), that reflects the classifier’s confidence that the keyword is relevant to the document. Hard classification is more appropriate for classifiers that operate without human intervention. Soft classification is more appropriate for systems that rank keywords in terms of their appropriateness to a document, but where a human expert makes the final decision (see Sebastiani 2006).

In our experiments, we assume that the classifier will present a ranked list of candidate terms to the human expert for final decision-making.

We use the following metrics to measure performance:

  • Precision:  the fraction of retrieved instances that are relevant. Precision can be interpreted as the number of keywords judged as correct divided by the total number of keywords assigned.
  • Recall:  the fraction of relevant instances that are retrieved. Recall can be interpreted as the number of correct keywords assigned, divided by all keywords deemed relevant to the object.
  • F1 score (or F-measure, F-score):  F1 score considers both recall and precision to measure accuracy. It can be interpreted as a weighted average of the precision and recall (best value 1, worst 0).

4.2 Statistical significance testing

The purpose of statistical significance testing is ‘to help us gather evidence of the extent to which the results returned by an evaluation metric are representative of the general behaviour of our classifiers.’  (see Japkowicz 2011).  In other words, can the observed results be attributed to real characteristics of the classifiers under scrutiny or are they observed by chance?

The t-test assesses whether the means of two groups are statistically different from each other (see Fisher). This analysis is appropriate whenever you want to compare the means of two groups. In our work on SKOS-HASSET, where appropriate, we determine any significant differences by performing pairwise t-tests (p < 0.05) using the R statistics package. This means that five times out of a hundred you would find a statistically significant difference between the means even if there was none.

We are currently working out the exact details of how to perform the evaluation.Questions we are considering include:

  • Who should evaluate the system output against the gold standard? (The original indexer will provide context and evidence for their decisions; a combination of the indexer plus a third party may be best.)
  • How do we judge precision and recall – in other words, how close does a term need to be to the gold standard?

5. Document preparation and representation

Document preparation involves some or all of the following steps:

  • Convert documents to plain text
  • Apply tokenization:  break the stream of text into words, phrases, symbols, or other meaningful elements called tokens.
  • Remove ‘stop words’ (i.e. tokens/keywords that bear no content, such as articles and prepositions, or whose content is not discriminating for the document collection (e.g. ‘data’ in our experiments)).
  • Apply stemming:  tokens are reduced to their ‘stem’ or root form.  For example “searcher”, “searches”, “searching”, “searched” and “searchable” would all be reduced to “search”. The stems may not be real words – e.g. “computation” might be stemmed to “comput”.  A system that converts a word to its linguistically correct root (“compute” in this case) is called a lemmatiser. In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of Information Retrieval (IR) applications. Stemming/lemmatising also abstract away from common spelling variants (e.g. –ization/isation) and reduce the number of distinct terms needed for representing a set of documents and thus save storage space and processing time.
  • Transform documents into a vector space, whose dimensions correspond to the terms that occur in the training set.
  • Apply a weight to each term: weightsare intended to reflect the importance a word has in determining the semantics of the document it occurs in, and are automatically computed by weighting functions.
  • Apply document length normalisation

 Document length normalisation

As we are applying term weighting (term importance in a document), we need to consider document’s length. Long documents usually use the same terms repeatedly. Therefore, term frequency may be larger giving a higher weight to non-stop words. Furthermore, long documents have many different terms, which increase the number of matches between a query and a long document (see Singhal 1996). Document length normalisation is a way to overcome the above problems by penalising the term weights for a document in accordance with its length. In our project document length normalisation will be applied to ensure a fair recall/precision and F-measure scores between short and long documents.

One common normalisation technique is Cosine-Normalisation:

Where ‘Wi’ is the weight for a term. Cosine-Normalisation solves the problem of higher term frequencies and more terms.

We are currently reviewing different tools and methods for each of the above steps.

6. Text mining Tools/techniques

We are currently reviewing different tools and techniques for performing automatic indexing. These include term frequency–inverse document frequency (TDF/IDF) and Kea.

7. Experiments and evaluation methodology

Supervised learning techniques need to undergo a number of steps that include:

  1. Pre-process text.
  2. Extract terms.
  3. Map terms to HASSET.
  4. Compare results with gold standard.
  5. Tune parameters to maximise precision and recall.
  6. Compare results to gold standard.

8. Presenting the results

We are currently reviewing the literature for ways in which to present our results. Hilaoutakis (2009), for example, describes a comparative study of three systems using precision and recall. Steinberger et al. (2012) report correlations between, amongst other things, precision, recall and F1 with the number of stopwords used, document collection size and number of keywords in the thesaurus.

As soon as we have some results to share, another blog post will follow.

References

Ananiadou,  S., Thompson,  P., Thomas, J., Mu, T., Oliver, S., Rickinson, M., Sasaki, Y., Weissenbacher, D. and McNaught,  J. (2010) ‘Supporting the education evidence portal via text mining’, Philos Trans R Soc, 368, pp.38293844.

Dash, N.S.  (2002) ‘Lexical polysemy in Bengali: a corpus-based study’, PILC Journal of Dravidic Studies, 12(1-2), pp.203-214.

Dash, N.S. (2005) ‘The role of context in sense variation: introducing corpus linguistics in Indian contexts’,  Language In India, 5(6), pp.12-32.

Dash, N.S. (2008) Context and contextual word meaning,  Journal of Theoretical Linguistics, 5(2), pp. 21-31.

Dryad HIVE Evaluation, https://www.nescent.org/sites/hive/Dryad_HIVE_Evaluation

Fisher, R.A. (1925) Statistical Methods for Research Workers, 1st ed., Edinburgh: Oliver & Boyd. http://psy.ed.asu.edu/~classics/Fisher/Methods/

Funk M.E., Reid C.A. and McGoogan, L.S. (1983) ‘Indexing consistency in MEDLINE’, Bull Med Libr Assoc,. 1983, 2(71), pp.176–183. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC227138/

Hliaoutakis, A. (2009), Automatic term indexing in medical text corpora and its application to consumer health information systems, Master’s thesis, Department of Electronic and Computer Engineering, Technical University of Crete, Greece, December 2009. http://poseidon.library.tuc.gr/artemis/MT2010-0020/MT2010-0020.pdf

Jacquemin, C. and Daille, B. (2002) ‘In Vitro Evaluation of a Program for Machine-Aided Indexing’, Information Processing & Management, 38, Issue 6, November 2002, pp. 765–792.

Japkowicz, N. (2011) Performance evaluation for learning algorithms, online tutorial. http://www.site.uottawa.ca/~nat/Talks/ICMLA-Tutorial.pptx

Névéol, A, Zeng, K. and Bodenreider, O.  (2006) ‘Besides Precision & Recall Exploring Alternative Approaches to Evaluating an Automatic Indexing Tool for MEDLINE’, AMIA Annual Symposium Proceedings 2006, pp.589–593 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839480/

Pouliquen, B., Steinberger, R. and Ignat, C.  (2003)Automatic annotation of multilingual text collections with a conceptual thesaurus, Proceedings of the Workshop Ontologies and Information Extraction at EUROLAN 2003, http://arxiv.org/ftp/cs/papers/0609/0609059.pdf

Sebastiani, F. (1999) ‘A tutorial on automated text categorisation’, in  A.  Amandi  and A. Zunino (eds.), Proceedings of the 1st Argentinian Symposium on Artificial Intelligence (ASAI 1999), Buenos Aires.

Sebastiani, F. (2006) ‘Classification of text, automatic’, in K. Brown (ed.) (2006) Encyclopedia of Language and Linguistics, 2nd ed., 2, pp.457-462, Oxford: Elsevier.

Singhal, A., Buckley, C. and Mitra, M. (1996) ‘Pivoted document length normalization’ In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’96), ACM, New York, NY, USA, pp.21-29.

Steinberger, R., Ebrahim M., and Turchi, M. (2012) ‘JRC EuroVoc Indexer JEX – A freely available multi-label categorisation tool’, LREC conference proceedings 2012.

Van Rijsbergen, C.J.  (1979) Information Retrieval, 2nd ed., Newton, MA :Butterworth-Heinemann.

Posted in Evaluation | 1 Comment

Risks and budget

This blog will contain our analysis of any risks we think we might encounter and the budget we have available.

Risk analysis

Risk Probability Severity Risk score
Financial 1 5 5
Legal 2 5 10
Staff retention 2 5 10
Underestimation of time required 2 5 10

We have contingencies in place, should any of these risks materialise, for instance, we would engage the legal expertise of the University of Essex and JISC should any legal issues arise.  If staff leave, the timeline will be reassessed and new staff recruited, if time and resources permit.  The Project Manager is also monitoring tasks in line with the Gantt chart and will communicate any slippage in a timely way.

Budget

The project’s total budget is £97,893, with £69,999 (72%) being funded by JISC.  The budget breaks down as follows:

SKOS-HASSET budget

Lucy Bell

Posted in Project Management | Leave a comment