SKOS-HASSET webinar, broadcast 28 March 2013

A webinar describing the work of the SKOS-HASSET project and showcasing its results was broadcast on Thursday 28 March 2013.  The webinar was recorded and this file, plus the static slides, have been published on the UK Data Archive’s SKOS-HASSET pages.  Please do take a look.

The SKOS-HASSET project has now finished.  Thank you to everyone who contributed to it.

Lucy Bell

SKOS-HASSET Project Manager

Posted in Uncategorized | Leave a comment

Technical objectives and deliverables

Lucy Bell

Introduction

The SKOS HASSET project had several technical objectives and deliverables:

  1. to create SKOS-HASSET by applying RDF to an existing, well-respected and well-used thesaurus (HASSET)
  2. to bring HASSET and ELSST into a single framework at database level
  3. to improve and update HASSET’s online user-facing webpages, hosting SKOS-HASSET and using open source technologies wherever possible
  4. to extend ELSST’s online management interface (http://elsst.esds.ac.uk/) to facilitate the release of new versions of the thesaurus products

Following agreement from the JISC in October 2012, the second and fourth of these original technical objectives were refined and amended.  Two wide-ranging objectives in fact became three more specific ones.  This was done in response to a changed requirement landscape and to pave the way for further, and more in-depth development work for which we’ve received additional funding.

Rather than bringing both HASSET and ELSST together on a single platform and tweaking the ELSST management interface, it was agreed to:

  1. test the alignment of the HASSET and ELSST hierarchies by injecting HASSET terms into ELSST and testing that the combined hierarchies work
  2. establish version control system for new releases
  3. release a new version of ELSST, testing the mechanism

These actions will provide us with a good, solid base on which we can entirely re-imagine the management interface and underlying data, rather than tweaking an existing system.

SKOS-HASSET

SKOS-HASSET was created and validated and released online on 26 February 2013.  As was documented in a previous blog, we used Pubby as the publication tool.  This previous blog from Darren Bell describes the work undertaken to achieve this objective.

The product is available as genericode, Turtle and RDF.

Web pages

The HASSET web pages have been extended and enhanced with new, SKOS-related information and a browseable version of the thesaurus.  This HASSET browser (in beta at present) obtains its data from WCF REST services, supplying Json objects obtained from the relevant database queries.  Select boxes have been used to generate the humanly-browseable structure for HASSET.  Initially, a proof of concept was set up using asp.NET web forms.  This was further developed to enhance the users’ experience, by allowing searches within the terms, while also protecting the Archive’s intellectual property.  These new and updated pages were released on 27 March 2013.  Feedback from users is welcomed.

An online licence form for requests to download and use the entire thesaurus is also being developed.  We expect to release this within 2013.

Alignment of hierarchies

Information development and technical development work combined to achieve this objective.

Our project officers, Lorna Balkan and Suzanne Barbalet, compared all the hierarchies within HASSET and ELSST.  Those which differ were thoroughly investigated, with all the history and log files consulted and the extent of the issues identified.  The following results were found:

  • terms that are in HASSET but not in ELSST:
    the majority of these will remain but will not be deemed to be ‘core’ terms;
    those considered to have international applicability have been added to the ELSST comments file for discussion with CESSDA colleagues
  • terms that are in ELSST but not in HASSET:
    these were more crucial as they could have skewed the ‘core’ hierarchies;
    the majority of these were methodological terms; however, a small number were concepts that had been deleted from HASSET (but not yet from ELSST) in order to maintain currency and relevance of the thesaurus.  After investigation and consultation with European colleagues, it was decided that these terms should in fact be proposed as deletions from both products.  This will require official international agreement; to expedite this these terms have been added to the ELSST comments file as suggestions for deletion.

Technical systems have been established to monitor any differences between the two products, using SQL Server Reporting Services.  Ten reports have been set up, with alerts, to check that the hierarchies remain in alignment from now until their inclusion in a single application.

Additionally, systems have been established at the database level to identify all terms shared between the two products (known as ‘core’ terms).

Version control

A version control system has been established for both HASSET and ELSST.  The following principles are being followed:

  1. All terms are date-stamped
  2. All changes to terms are recorded, no matter how small, and stored in the HASSET history file.  The details of the user who made the changes are also recorded
  3. All version information is available to the project team via a SQL Server Reporting Services dynamic interface
  4. Live versions of the thesaurus products are made available at regular, agreed intervals:
    1. ELSST is released annually, with major increments (1.00, 2.00 etc.); minor increments are not expected, but provision has been made for them in the first year
    2. SKOS-HASSET as an external product is released quarterly, with minor increments and annually as a major increment (1.00, 1.01, 1.02, 1.03, 2.00 etc.)
    3. HASSET is constantly updated and available for use for indexing internally
    4. SKOS-HASSET and ELSST annual version numbers will match

In order to test and implement this system, a previously-released version of HASSET was identified and version control applied.  This was version 1.00.  SKOS-HASSET was then released on 26 February 2013, version 1.00.  A second release (version 2.00) was then made on 25 March 2013.

From this point on, the pattern of quarterly releases began, with the next SKOS-HASSET version due in the second quarter of 2013.  This will be version 2.01.  A formal, internal procedure for managing these releases has been established.

Release of new version of ELSST

All existing ELSST translators and IP owners have been contacted and kept informed of all developments.

A new version of ELSST, including 136 ‘core’ terms agreed to have international applicability, was released on 25 March 2013.  This is version 2.00, bringing all the versions of the thesaurus products in line.  Version control will be applied at the table level.

Conclusion

All our technical objectives have now been completed and we are ready to move forward with our new and improved thesaurus products.  We are looking forward to taking this work further by entirely re-developing the management interfaces, which will give us, our international ELSST colleagues and the users of our thesauri improved and enhanced applications.

Posted in Project Management, Technical | Leave a comment

Automatic Evaluation Recommendations Report

Lorna Balkan

1 Introduction

This report describes how we applied the automatic indexing tool, KEA (see Witten et al. (2005)) to some of the UK Data Archive’s document collections, provided through the UK Data Service, and how we evaluated the results. Our aims were (1) to see whether Kea could potentially be used in the future at the Archive to aid metadata creation and (2) to develop recommendations for the future use of automatic indexing with an existing thesaurus.

Specifically, we sought to answer the following questions:

  • How well does KEA perform (compared to the gold standard) across a variety of corpora, where these corpora differ with respect to:
    1. genre
    2. topic
    3. total corpus size
    4. document length
    5. indexing type (in terms of number and level)
  • Why is Kea more successful on some document collections than others?

The experiment also revealed ways in which the thesaurus, HASSET, and Archive-internal metadata processes could be improved.

2 Evaluation tools and data

2.1 The document collection

Our initial intention was to use KEA to index the following document collections:

  1. The Nesstar bank of variables/questions
  2. Survey Question Bank (SQB) questionnaires
  3. ESDS data catalogue records
    1. abstracts (from all catalogue records)
    2. full catalogue records (from Study Number 5000 onwards: these are the most recent catalogue records, dating from 2005)
  4. Other full-text documents:
    1. case studies
    2. support guides

Corpus (3) was conflated into one corpus, consisting of all partial catalogue records. This happened because some studies did not have any abstracts, so basing the indexing on abstracts alone was not useful. The full catalogue record contained too many fields that were producing unhelpful index terms, so it was decided to use partial records only, consisting of the following fields:

  • Title
  • Alternative title
  • Series title
  • Abstract
  • Geographical units
  • Observation units

For the purposes of training KEA, all corpora needed to have manual indexes. Collections (2) and (3) had been previously indexed (but note that in Corpus (2) the number of manual index terms was restricted by the size of the PDF’s ‘document properties’ box, and in Corpus (3) manual indexing is based on full documentation, rather than the abstracts). Collection (4) had been indexed at subject category level. During the project, a mapping was defined between the subject categories and HASSET, and these HASSET terms then assigned to this collection (note that these terms are generally high level terms). Collection (1) was manually indexed especially for the project. See Barbalet (2013) for a fuller discussion of the manual indexing.

These four corpora differ in terms of:

  1. genre
  2. topic
  3. total corpus size
  4. document length
  5. indexing type (in terms of number and level)
  6. average number of manual keywords assigned per document

Note that, although corpus (4) consists of different types of document (user guides and case studies), because there were relatively few documents of each type, they were considered to be one corpus as far as training KEA was concerned.

Each document collection was divided into a training dataset (80% of the total number of documents in the collection), and a test dataset (the remaining 20%). KEA was trained on each training dataset separately, and evaluation results reported for that test dataset. Fifty documents from each test dataset were selected at random for manual evaluation (see Section 4 below).

2.2 HASSET

HASSET has been described in a previous blog (see Balkan and El Haj (2012)). For the purposes of the evaluation experiment, geographical terms were removed, since they produced too many incorrect index terms.

2.3 Description of KEA

El Haj (2012) has described in detail how KEA works. It is a term extraction system which includes a machine learning component. The algorithm works in two main steps:

(1) candidate term identification, which identifies candidate phrases (n-grams) from the text and maps these to thesaurus terms (in our case, HASSET). Candidate terms that are synonyms of preferred terms are mapped to their preferred terms.

(2) filtering, which uses a learned model (using training data labelled with thesaurus terms) to identify the most significant keywords based on certain properties or “features” which includes the tf.idf measure, the position of the first occurrence of a keyword in a document, the length of keywords and node degree (the number of keywords that are semantically related to a candidate keyword).

KEA was applied to three corpora (Nesstar questions/variables, SQB questionnaires and case studies/support guides) in stemming mode. The model was then rerun in non-stemming mode on the catalogue records.

For three corpora (SQB questionnaires, catalogue records and case studies/support guides) the system was set to produce up to 30 KEA keywords. For Nesstar, a maximum of 10 KEA keywords were generated, due to the small amount of text and the correspondingly few keywords that had been manually assigned.

3 Evaluation methodology

3.1 Overview

To judge the system’s performance, two types of evaluation were carried out:

  • automatic
  • manual

In the automatic evaluation, KEA-generated keywords were compared with the set of manually assigned keywords (the so-called ‘gold standard’). In addition, a manual evaluation was performed on a subset (50 documents) of the test set, which involved comparing the KEA keywords to the texts they have been used to index (see Section 4 below). The manual evaluation also sought to discover why KEA either failed to find concepts that had been assigned manually (so-called ‘silence’) or suggested incorrect terms (so-called ‘noise’) (see Section 7 below).

3.2 Evaluation metrics

The main evaluation metrics we used were precision, recall and F1-score, defined as follows:

· Precision =

· Recall =

· F1-score = 2 *

For the automatic evaluation, we define a KEA keyword to be ‘relevant’ if it is an exact match with a manual keyword. In manual evaluation, evaluators can judge a keyword to be relevant, even if it is not an exact match (see Section 4.3 below).

Precision, recall, and F1-scores were calculated on a document level, then aggregated over each document collection. We used an example-based, as opposed to a label-based approach for our aggregation scores (see Santos et al. (2011) and Madjarov et al. (2012)). The example-based approach sums the evaluation scores for each example (or document) and divides them by the total number of examples, while the label-based approach computes the evaluation score for each label (or keyword) separately and then averages the performance over all labels. Label-based evaluation includes micro-precision, macro-precision, etc. Given the large amount of keywords in our collection (potentially, the whole of HASSET), the example-based approach was preferred.

Formally, example-based precision, recall are calculated as follows (see Santos et al. (2011)):

Precision =

Recall =

where N is the number of examples in the evaluation set, Yiis the set of relevant keywords, and Zi is the set of machine generated keywords.

We calculated the Average F1 score by summing the F1 scores of all the documents and dividing by the number of documents.

4 Manual evaluation

4.1 Overview

In the automatic evaluation KEA-generated keywords were compared with the manually assigned keywords only. In the manual evaluation, evaluators were asked to judge how relevant the automatically-generated keywords were to the text to which they referred. Evaluators could also record, as secondary information, reasons for lack of relevance, and how close the KEA terms were to the set of manual keywords semantically – see Section 4.3 below.

4.2 Evaluators

Two evaluators were assigned to the evaluation task. Evaluator (1), who is an expert indexer, evaluated each corpus (except the user guide/case study corpus) for relevance; evaluator (2) evaluated the user guide/case study corpus for relevance and assigned relatedness scores to the corpora (see 4.3 below). Both evaluators have many years’ experience of working at the UK Data Archive, and have a thorough knowledge of HASSET.

4.3 Stages and protocol

Manual evaluation was performed in two separate stages. First, the evaluator was presented with an evaluation form (see Appendix 5) and asked to read the original text to judge the relevance (or suitability) of each KEA keyword, on a 3-point scale:

How suitable is the KEA term for Information Retrieval?

5. Extremely suitable = should definitely be keyword

2. Partially suitable = redundant[1], or somewhat too narrow or too broad

0. Unsuitable = far too broad, or completely wrong

These scores were used to derive Precision, Recall and F1 scores for the corpora (see Section 3.2 above). The definition of relevance/suitability in our experiment depends on the type of evaluation, as follows:

  • Automatic evaluation: the KEA keyword is considered relevant only if it is an exact match of a manual keyword
  • Manual evaluation:
    • ‘strictly relevant’: the KEA keyword is considered relevant if it is either an exact match of a manual keyword, or rated ’extremely suitable’ by the evaluator
    • ‘broadly relevant’: the KEA keyword is considered relevant if it is either an exact match of a manual keyword, or rated either ‘extremely suitable’ or ‘partially suitable’ by the evaluator.

In the evaluation form presented to the evaluator, the manual keywords are shown in alphabetical order, next to the KEA keywords that appear in the order in which they are ranked by KEA. It is assumed that all exact matches are ‘extremely suitable’, since they have been assigned by professional indexers, so the form is pre-filled to reflect this.

Evaluators were also asked, for keywords that are either ‘partially suitable’ or ‘unsuitable’, to provide the following information:

Reason for lack of suitability of Kea term:

  • Too broad
  • Too narrow
  • Redundant = concept already covered by other terms (that form an associative relationship) in the KEA set
  • Completely wrong

This information is used for the informal error analysis we performed (see Section 7). (Note that it was initially assumed that all redundant terms would be partially suitable, but in the event some were judged to be unsuitable.)

A second stage of evaluation, which was carried out independently of the first stage, sought to establish how closely related the KEA keywords are to the manual keywords, according to the following scale and criteria:

To what extent is the KEA term semantically related to the Gold standard?

5. Totally related (exact match)

4. Closely related: NT, BT or RT to manual keyword

3. Somewhat related: in the same hierarchy as manual keyword

2. Remotely related: related, but not in the same hierarchy as manual keyword

1. Unrelated

The first category (‘exact match’) is computed automatically and pre-filled in the evaluation form[2]. Note that relatedness scores were not calculated for the catalogue record corpus, due to time constraints.

5 Comparison with other approaches

The standard evaluation paradigm for automatic indexing is to automatically compare machine-generated indexes with a gold standard. The problems with this approach are well documented, since the choice of index terms is often very subjective[3]. Approaches to overcoming this problem include pooling the set of index terms suggested by a number of different indexers, and taking either that or their intersection as the gold standard (see for example Pouliquen et al. (2003)). Another common approach is to accept not just exact matches as relevant terms, but those terms that are semantically related to the manual keywords. Semantic relatedness to the manual keywords can be computed automatically based on various criteria, for example closeness of the terms in the thesaurus hierarchy (see for example Medelyan and Witten (2005) and/or morphological similarity (for example Zesch, T. and Gurevych, I. (2009).

A problem with these approaches is that they assume that the closer the match to the gold standard semantically, the more relevant a term will be as a keyword. However, we show (in Section 8.4 below) that this assumption is not always true, as in some of our corpora, some relevant keywords were totally unrelated to those in the gold standard for reasons discussed in Section 2.1 above.

The alternative to automatic evaluation is to perform a manual evaluation. However, this is both costly and time-consuming. In a manual evaluation, evaluators are most often asked to compare the set of machine-generated keywords directly with the source text (see for example Jacquemin et al. (2002), Névéol et al. (2004) and Medline (2002)). While this is a useful approach, it does not capture any relationship between what the machine assigns and what the human assigns as keywords, which may be useful to know.

Our approach aims to capture both relatedness to the gold standard and relevance to the source text.

Many manual evaluations also involve a qualitative analysis of the automatic indexing terms (see for example Abdul and Khoo (1989), Eckert et al. (2008), Clements, and Medline (2002). We also undertook an error analysis of our results (see Section 7).

6 Limitations of the approach

Our experiment suffered from a number of limitations, due mainly to time constraints:

  • Gold standard:
    • Taking the set of manually assigned keywords as the gold standard is particularly problematic for the following datasets:
      • Catalogue records: manual keywords and KEA keywords have not been used to index the same thing – the manual indexers have indexed the whole documentation, while KEA has been used to index partial catalogue records only, including the abstract. The abstract is, however, often taken from the documentation and is a summary of it.
      • SQB questionnaires: due to space restrictions (see Section 2.1 above) existing manual keywords have been limited in number.
      • Case studies/user guides: a small subset of the thesaurus (mapped from subject categories) has been used for manual indexing, while KEA used the whole thesaurus.
  • The number of keywords in the gold standard does not always match the number of keywords generated by KEA. Where there are more manually-generated keywords than KEA keywords, 100% recall will not be possible. Several researchers have suggested recalculating evaluation scores for different numbers of automatic keywords (e.g. 5, 10, 20) or using a ‘dynamic rank’, i.e. where the number of manually assigned keywords is the same as the number of automatically assigned keywords (see for example Steinberger et al. (2012)). As Steinberger et al. point out, however, this is not helpful for new texts which have no manual terms as a reference.
  • Number of evaluators:Due to time and resource restrictions, only one evaluator was used to evaluate each document – ideally, a number of evaluators would evaluate each document and results averaged across them.
  • Indexer-centred evaluation:This evaluation is very much indexer-centred, since it is designed to investigate whether or not KEA could be a useful tool for indexers. To get a proper estimate of its value to users, a user-centred evaluation would need to be conducted.
  • Evaluation form:A blind evaluation, where the evaluator is unaware whether the keyword has been generated manually or automatically, would make the evaluation less subjective.

7 Error analysis

We distinguish between precision errors, where KEA returns incorrect or partially relevant terms, and recall errors, where KEA fails to find relevant terms.

7.1 Precision errors

Reasons for poor precision include cases where the KEA keyword was:

  1. too broad
    E.g. EMPLOYMENT for when the text is about “STUDENT EMPLOYMENT”
    This includes cases where the keyword is chiefly used as a placeholder in HASSET, with a scope note advising the use of a more specific term instead:
    E.g. RESOURCES has the following scope note: “AVAILABLE MEANS OR ASSETS, INCLUDING SOURCES OF ASSISTANCE, SUPPLY, OR SUPPORT (NOTE: USE A MORE SPECIFIC TERM IF POSSIBLE). (ERIC)”)
  2. too narrow
    This generally only occurred in the case study/support guide corpus.
  3. redundant
    E.g. CHILDBIRTH when PREMATURE BIRTHS is also found
  4. somewhat or completely wrong

Category (4) can be for a variety of reasons, including:

  1. KEA identifies the correct term, but it has a different meaning in HASSET to that in the text because:
    • they are homonyms:
      E.g. KEA retrieves the keyword WINDOWS (meaning features of a house) because it matches “Windows” (Computer software) in the text
    • the term is used idiomatically in the text, but has a literal meaning in HASSET[4]
      E.g. PRICES is used in its literal sense in HASSET, but idiomatically in the text – “Do higher wages come at a price?”
    • the HASSET term is used in a restricted sense (often indicated in a scope note) which is different to the general language usage found in the text
      E.g. the HASSET term WORKPLACE is used to refer to the location of work only
  2. KEA identifies the wrong term because:
    • it fails to distinguish between terms containing qualifiers
      E.g. KEA retrieves CHILDBIRTH (UF: “LABOUR (PREGNANCY)”) instead of ” LABOUR (WORK)” because it matches “labour” (meaning ‘work’) in the text
    • it is unable to parse a compound term correctly:
      E.g. KEA retrieves DEVELOPMENT POLICY because “collections development policy” was found in the text (i.e. it matches an incorrect sub-part of the compound term)
    • the errors are due to stemming:
      E.g. NATIONALISM matches “nation” in the text
      E.g. TRUST matches “trusts” in the text
      E.g. ORDINATION matches “co-ordination” in the text
  3. Too many closely related terms in HASSET make it difficult for KEA to discriminate between them. Examples include:
    • the many variants of TRAINING in the thesaurus – FURTHER TRAINING, OCCUPATIONAL TRAINING, EMPLOYER-SPONSORED TRAINING
    • OFFSPRING and CHILDREN: OFFSPRING has the scope note: USE SPECIFICALLY FOR CHILDREN, REGARDLESS OF AGE. NOT TO BE USED AS AN AGE IDENTIFIER. MAY BE USED FOR ADULT CHILDREN, OR FOR QUESTIONS WHERE AGE OF CHILD IS NOT SPECIFIED. TERM CREATED JUNE 2005. PREVIOUSLY THE TERM “CHILDREN” MAY HAVE BEEN USED.
    • CRIMES is a UF of OFFENCES, not CRIME, which is a separate term
  4. The term belongs to a part of the document that should be ignored for indexing purposes (e.g. author names in case studies). Note that KEA gives greater weight to terms the closer they occur to the beginning of the document. This causes a problem in some corpora, e.g. user guides and case studies, which often begin with background text to set the topic in context.

Possible solutions for precision errors:

  • Add terms to stopwords:
    E.g. INFORMATION, DATA, RESEARCH, ANALYSIS, EVALUATION, TESTS
  • Add new UFs to preferred terms
  • Reduce stemming[5]
  • Remove irrelevant parts of the text:
    This would be relatively easy to do for some of our corpora, e.g. case studies
  • Apply word sense disambiguation (WSD) techniques to help identify the correct use of a homonym in HASSET:
    There is some form of context sensitivity in KEA, since the filtering stage (see Section 2.3 above) is based partly on the node degree (the number of keywords that are semantically related to a candidate keyword). Other ways of introducing context sensitivity have been discussed in the literature – see for example Pouliquen et al. (2003) who use the notion of associate terms to help select keywords.

7.2 Recall errors

Reasons for poor recall include:

  1. The concept is in HASSET, but is not recognised by KEA
  2. The concept is not in HASSET so has to be represented by a combination of other terms. Examples include:
    • CHILDHOOD POVERTY is represented by SOCIALLY DISADVANTAGED CHILDREN
    • some methodological terms

Sources of recall errors include:

  1. The HASSET term has a slightly different form from that found in the text:E.g. PRICE POLICY is not found although “pricing policy” is in text
  2. The HASSET term is hyphenated, while the term in the text is not:E.g. BREAST-FEEDING not found although “breastfeeding” is in text
  3. the HASSET term is too abstract to be found verbatim in the textE.g. LIFE SATISFACTION is not found, although “enjoying life” is in text

In many cases, the source of recall errors is not obvious, and needs further investigation.

Possible solutions for recall errors include:

  • Add new UFs to current HASSET terms
  • Add new preferred terms
  • Add stemming

8 Discussion of the results

8.1 General remarks

The following subsections summarise the results of the evaluation as shown in the appendices below. The results are all based on 50 samples of each test set. It should be borne in mind that the catalogue record results were produced using a different model of KEA, and unlike the other corpora, KEA indexing was based on a subset of the text that was used for the manual indexing, as explained in Section 2.1 above. For this reason, the results are not directly comparable with those of the other corpora. 8.2 Precision, recall and F1

Performance was measured in terms of Precision, Recall and F1. Three different degrees of each were recorded – Auto, Strict and Broad, as shown in Appendix 1:

  • Auto: auto scores are based on exact matches between KEA and manual keywords. They can be computed automatically
  • ‘Strict’: include exact matches and additional KEA keywords rated ’extremely suitable’ by the evaluator
  • ‘Broad’: include exact matches, and additional KEA keywords rated either ‘extremely suitable‘ or ‘partially suitable’ by the evaluator.

Individually, the best performance overall was seen in the SQB corpus, with a broad F1 score of 0.43. Close behind are the Nesstar and case studies/support guide corpora, with c. 0.35 each. Catalogue records had a low F1 score of 0.21. This was to be expected, given that KEA had relatively little text to index from, compared to the manual indexers. This, together with the fact that KEA was applied in non-stemming mode, led to a poor recall score. However, the precision rate for catalogue records was 0.42, which means that the keywords KEA found are very often relevant.

The highest recall score was found in the case study/support guide corpus (0.73). This suggests that KEA could be usefully employed to suggest new relevant terms for this type of corpus.

As expected there was relatively little overlap between KEA keywords and manual keywords (on average KEA found 18.60 keywords per document across the four corpora, of which only 2.33 were exact matches with the manual keywords) – see Appendix 4.2. However, a high percentage of KEA keywords were considered relevant/suitable even if they were not exact matches – 33% for the SQB corpus, with an average of 25% across all four corpora. This suggests that KEA could be a very useful tool for indexers.

It is not clear, from our initial experiments, to what extent the system’s performance is dependent on the number and size of training documents. Our four corpora differ considerably in this respect – the Nesstar collection contains the largest number of documents, but these are very small in size, containing a single question, while the SQB corpus contains the largest documents. The training datasets are 80% of the total number of documents in each corpus. If we assume that the training datasets are also 80% of the size of the entire corpora (which may not hold, since documents are not of a uniform length) then we can conclude that the highest F1 score is reported for the the corpus with the largest amount of training data (i.e. the SQB corpus as shown in Appendix 1) rather than that with the largest number of training documents (i.e. the Nesstar corpus). Performance is clearly related not just to the number and size of the training documents but to their associated keywords. The average number of manual keywords assigned per document varies considerably – in the sample shown in Appendix 4.2, the average number of manual keywords varies from 1.63 for Nesstar questions/variables to 62.86 for catalogue records – but it is important to bear in mind differences in the completeness and level of the keywords, as well as their number.

Further investigation is also required to establish the influence of genre on performance. For example, the case studies and support guide collections were considered to be a single corpus for the sake of the experiment, but may exhibit different behaviour if processed and evaluated separately. The support guides, for instance, use more formal language than the case studies, which are aimed at a popular audience, so are unlikely to cause precision errors due to the idiomatic use of language.

Within text types, topic clearly plays a role. For example, support guides on methodology and cataloguing procedures fared less well than the guides on substantive topics like health and employment, since HASSET has few terms to cover the first topics.[6]

8.3 Reasons for partial or lack of suitability of keywords

For three out of the four corpora (Nesstar, SQB and catalogue records), those terms which are deemed to be partially suitable, instead of extremely suitable were too broad (see Appendix 2). Only in the case of case studies/support guides, were a sizeable proportion of partially suitable keywords (c.50%) deemed to be too narrow.

Across all four corpora unsuitable terms were usually judged to be completely wrong, rather than too broad.[7]

8.4 Relatedness scores

Relatedness measures how close the KEA keywords are to the manual keywords.

Across the three corpora that we rated for relatedness (Nesstar, SQB and case studies/support guides), there is no consistent relationship between the suitability of KEA keywords and their relatedness to the manual keywords. In the case of Nesstar, 100% of keywords that were not exact matches of the manual keywords but deemed ‘extremely suitable’ were either closely or somewhat related to the manual keywords, while with the SQB and case studies/support guides over 50% were either remotely or unrelated to the manual keywords (see Appendix 3).

This suggests that, in the absence of manual evaluation, relatedness of KEA keywords to manual keywords based on their position in the thesaurus could not be used as a good indicator of whether or not they are extremely relevant. A similar situation is true for partially suitable keywords and their relationship with manual keywords.

There could be several reasons for this. First, the manual keywords may exclude important topics, due to time or space limitations (this is particularly true for the SQB and the case study/support guide corpora). Alternatively, or additionally, the KEA keywords may well be closely related to the manual keywords, but this is not reflected in the thesaurus structure, upon which the relatedness definitions are based. An example is VOTING, which was is not related to VOTING BEHAVIOUR or VOTING INTENTION in the thesaurus. Examination of cases such as these will be useful when we come to revise the thesaurus.

Conversely, some KEA keywords rated as ‘somewhat related’ because they share the same hierarchy as a manual keyword, are in fact semantically far distant. For example, SUGAR is in the same hierarchy as MEDICAL DRUGS and is thus judged to be somewhat related. This is because their shared hierarchy PRODUCTS is very large.

9 Conclusions and recommendations

Our experiments with KEA proved a useful introduction to automatic indexing at the UK Data Archive. The results of our initial investigations are encouraging, and lead us to believe that KEA could provide a useful tool to our indexers. We would however have to conduct user-oriented evaluation to see how the system could be incorporated into our work-flows.

It would also be useful to conduct further experiments with the system to see how we could improve the model and run the system more efficiently (see some of the suggestions we make in Section 7).

Work on KEA has also provided us with useful insights into how we could improve our processing procedures (which we reviewed when we prepared the texts and metadata prior to running KEA) and the thesaurus – for example, our preliminary error analysis highlighted the need for more synonyms and revealed cases where there are too many similar terms.

We make the following recommendations:

  1. KEA is a useful tool for indexers of full text social science materials;
  2. however, KEA would work best as a suggester of new terms, with moderation from a human indexer;
  3. KEA could also be used as a quality assurance tool, to ensure that terms are not overlooked – some terms it suggested that were highly relevant had not been included in the gold standard, manual indexing;
  4. more work is needed to investigate KEA further and to see how it could be incorporated technically, and in terms of process, into ingest systems.

Notes

1. See Barbalet (2013) for a more detailed discussion of redundancy in indexing.

2. Note that the calculation of categories ‘closely related’ and ‘somewhat related’ could also be automated, and this may be implemented at a future date.

3. Indexing of our corpora follows quality control procedures which helps address the problem of subjectivity – see Barbalet (2013).

4. Cases like these were only found in the case studies/user guides corpus.

5. Note, however, that while reducing stemming will improve precision, it will have a negative impact on recall, and there is always a trade-off between precision and recall.

6. It turned out also that the Nesstar training dataset contained many duplicated questions, unlike the test sample, which had very few, and this may have had an effect on the results.

7. Note: there were also some cases of terms being unsuitable because they were too narrow, or redundant – neither of these possibilities were envisaged when the experiment was set up, so statistics for these are not reported separately.  See Barbalet (2013) for possible reasons for these error types.

References

Abdul, H. and Khoo, C. (1989) ‘Automatic indexing for medical literature using phrase matching – an exploratory study’, In Health Information: New Directions: Proceedings of the Joint Conference of the Health Libraries Sections of the Australian Library and Information Association and New Zealand Library Association, Auckland, New Zealand. 12-16 November 1989, pp. 164-172.

Balkan, L. and El Haj (2012): SKOS-HASSET evaluation plan, blog.
https://hassetukda.wordpress.com/2012/08/16/skos-hasset-evaluation-plan/

Barbalet, S. (2013) Gold standard indexes for SKOS-HASSET evaluation: a review, blog. https://hassetukda.wordpress.com/

Clements, J. ‘An Evaluation of Automatically Assigned Subject Metadata using AgroTagger and HIVE’ http://aims.fao.org/sites/default/files/files/Clements_FAO_Metadata_Assignment.pdf

Eckert, K., Stucken-Schmidt, H. and Pfeffer, M. (2008): Interactive thesaurus assessment for automatic document annotation, in Proceedings of the Fourth International Conference on knowledge capture (k-cap 2007), Whistler, Canada. http://publications.wim.uni-mannheim.de/informatik/lski/Eckert07Thesaurus.pdf

El Haj, M. (2012) UKDA Keyword Indexing with a SKOS Version of HASSET Thesaurus, blog. https://hassetukda.wordpress.com/2012/09/24/ukda-keyword-indexing-with-a-skos-version-of-hasset-thesaurus/

Jacquemin, C., Daille, B., Royaute, J. and Polanco, X (2002): ‘In vitro evaluation of a program for machine-aided indexing’, Information Processing and Management, 38(6): pp. 765-792, http://perso.limsi.fr/jacquemi/FTP/IPM-1354-jacquemin-et-al.pdf

Madjarov, G., Kocev, D., Gjorgjevikj, D. and Džeroski, S. (2012) ‘An extensive experimental comparison of methods for multi-label learning’, Pattern Recognition, doi:10.1016/j.patcog.2012.03.004, http://kt.ijs.si/DragiKocev/wikipage/lib/exe/fetch.php?media=2012pr_ml_comparison.pdf

Manning, C., Raghavan, P., and Schutze, H. Introduction to Information Retrieval (2008), Cambridge University Press, http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-text-classification-1.html

Medleyan, O. and Witten, I. (2005) ‘Thesaurus-based index term extraction for agricultural documents, in: Proceedings of the 6th Agricultural Ontology Service (AOS) workshop at EFITA/WCCA 2005, Vila Real, Portugal. http://www.medelyan.com/files/efita05_index_term_extraction_agriculture.pdf

Medline (2002): A Medline indexing experiment using terms suggested by MTI: A report. http://ii.nlm.nih.gov/resources/ResultsEvaluationReport.pdf

Névéol, A., Soualmia, L.F., Douyère, M., Rogozan, A., Thirion, B. and Darmoni, S.J. (2004), ‘Using CISMeF MeSH ‘‘Encapsulated’’ terminology and a categorization algorithm for health resources, in International Journal of Medical Informatics, 73, pp. 57-54, Elsevier. http://mini.ncbi.nih.gov/CBBresearch/Fellows/Neveol/NeveolIJMI04.pdf

Pouliquen, B., Steinberger, R. and Ignat, C.(2003)Automatic annotation of multilingual text collections with a conceptual thesaurus, Proceedings of the Workshop Ontologies and Information Extraction at EUROLAN 2003, http://arxiv.org/ftp/cs/papers/0609/0609059.pdf

Santos, A., Canuto, A. and Feitosa Neto, A. (2011) ‘A Comparative Analysis of Classification Methods to Multi-label Tasks in Different Application Domains’, Computer Information Systems and Industrial Management Applications, 3, pp.218-227, http://www.mirlabs.org/ijcisim/regular_papers_2011/Paper26.pdf

Spasic, I., Schober, D., Sansone, S-A., Rebholz-Schumann, D., Kell, D. and Paton, N. (2008) ‘Facilitating the development of controlled vocabularies for metabolomics technologies with text mining’, BMC Bioinformatics 2008, 9(Suppl 5):S5. http://www.biomedcentral.com/1471-2105/9/S5/S5

Steinberger, R., Ebrahim M., and Turchi, M. (2012) ‘JRC EuroVoc Indexer JEX – A freely available multi-label categorisation tool’, LREC conference proceedings 2012. http://www.lrec-conf.org/proceedings/lrec2012/pdf/875_Paper.pdf

Witten, I.H., Paynter, G.W., Frank, E, Gutwin, C. and Nevill-Manning, C.G. (2005) ‘Kea: Practical automatic keyphrase extraction’, in Y.-L. Theng and S. Foo (eds.) Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, Information Science Publishing, London, pp. 129–152.

Zesch, T. and Gurevych, I. (2009) ‘Approximate Matching for Evaluating Keyphrase Extraction’, In Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing September 2009, pp. 484-489, http://aclweb.org/anthology-new/R/R09/R09-1086.pdf

Appendix 1 Average Precision, Recall and F1 scores

Appendix 2 Reasons for partial or lack of suitability of keywords

2.1 Reasons for partial suitability of keywords

Average percentage per document of partially suitable keywords that are:

Corpus name Too broad Too narrow Redundant
Nesstar 85.00% 5.00% 10.00%
SQB 64.85% 19.61% 12.35%
Cat. records 84.38% 0.00% 12.50%
Case studies/ support guides 50.69% 49.31% 0.00%

2.2 Reasons for unsuitability of keywords

Average percentage per document of partially suitable keywords that are:

Corpus name Too broad Completely wrong
Nesstar 23.66% 74.19%
SQB 1.35% 96.38%
Cat. records 1.27% 96.23%
Case studies/ support guides 16.22% 73.71%

Appendix 3 Average relatedness scores

3.1 Average relatedness to manual keywords of extremely suitable keywords that are not exact matches

Corpus name Closely related somewhat related remotely related unrelated
Nesstar 50.00% 50.00% 0.00% 0.00%
SQB 31.33% 17.71% 11.35% 41.63%
Cat. records
Case studies/ support guides 25.19% 14.29% 15.85% 44.63%

3.2 Average relatedness to manual keywords of partially suitable keywords

Corpus name Closely related somewhat related remotely related unrelated
Nesstar 27.50% 42.50% 12.50% 17.50%
SQB 29.79% 21.19% 9.42% 39.60%
Cat. records
Case studies/ support guides 32.37% 7.42% 16.01% 44.20%

Appendix 4 Other statistics

4.1 Average relatedness and suitability

Corpus name Average relatedness Average suitability
Nesstar 2.12 1.26
SQB 2.29 2.01
Cat. Records 1.97
Case studies/ support guides 1.67 1.04

4.2 Average number of keywords (manual and Kea) per document

Appendix 5 Example of evaluation form

Posted in Evaluation, Text mining | Leave a comment

SKOS HASSET Development Process

John Payne

One of the deliverables of the SKOS HASSET project was to provide the ability to present the HASSET thesaurus as SKOS linked data.  This required elements of work including database cleansing, application creation and software configuration.  As described in an earlier blog entry, we chose to deliver SKOS by utilising the open source project PUBBY and by converting our SQL Server based HASSET thesaurus into RDF and storing this in a BrightstarDB triple store.  For more information on SKOS, Pubby and Triple stores, refer to this post.

At the Data Archive, the Application Development and Maintenance team adopt Agile methodologies wherever possible to see projects through to deployment.  We employed these Agile techniques in the SKOS HASSET project in order to focus and drive development through to delivery of the completed product.

At the Archive, we have used JIRA for three years to manage both issue tracking and development tasks across our complete range of projects – both development and maintenance.  We have a plugin called GreenHopper installed within JIRA, that delivers an Agile presentation layer on top of the issues/tasks and provides configurable Scrum and Kanban views and functionality on top of any project or combination of projects you choose.  The documentation even suggests you combine the two and try Scrumban!

The combination of JIRA and Kanban where used for this sprint to track issues and progress.  JIRA issues contained the tasks, including current status, comments, time logging etc. and we used the Kanban JIRA plugin to give a visual representation of the current state of play and progression of tasks from ‘not started’ through ‘in progress’ to ‘complete’ every morning, rather than using physical post-it notes and a whiteboard.

We decided to use two sprints during the development of SKOS-HASSET with the second following several weeks after the first.  This is our preferred strategy.  Sprint one had specific goals in terms of laying the groundwork in terms of data quality and the production internally of valid a valid SKOS file.  Sprint two tied everything together and addressed any issues that came to light during sprint one.

The process we adopted was:

Sprint Preparation

During sprint preparation, the three developers involved met and picked through the complete list of issues/requirements to familiarise everyone with the task at hand.  These tasks where then created within JIRA and each was assigned to an individual and prioritised.

Sprint

The initial sprint lasted for five days and primarily involved validating and cleaning the data and creating an application to create valid triples from our relational database version of the thesaurus. Every morning of the sprint would involve a short ‘stand-up’ meeting where progress, problems and proposed work for the current day would be briefly described by each developer.  This was backup up visually by using the Kanban view provided by Greenhopper.  All application code created was stored in SVN source control and built from within Jenkins, our continuous integration server in order to satisfy our coding quality standards.

Post sprint review

In the week following the sprint, the developers met to reflect on what we had achieved and what issues we encountered.  This was also a good opportunity to make sure that both the addressed and remaining issues had been documented and commented upon in readiness for the second sprint.

Sprint 2

Sprint 2 was a smaller, two-day sprint and was the final push to actually move from our development environment to that of a production environment ready for external consumption.  The requirements for sprint 2 were not data-related but focused instead on implementing Pubby on a newly set up production server and ensuring that all underlying data creation and was now being supplied from the production environment.

Sprint preparation

The developers once again met to discuss the remaining list of issues/requirements.  These were then reassigned in certain instances and reprioritised.  During this second preparation phase, we also tried to resolve any external dependencies that would otherwise hamper the forthcoming sprint such as setting up of domain names and preparing for firewall changes etc.

Sprint

The second sprint was better described as a dash with it being so short!  Most of this sprint involved configuring a new production web server to host Pubby, correctly installing our Triple Store onto its live server and deploying application code and tables from development to production.

Post sprint

It would be lovely to say that after sprint 2, all our issues were closed but this is not quite true.  We still have a couple of small internal loose ends but these either do not directly affect the SKOS HASSET product or they were moved out of the scope of this development cycle. One advantage of JIRA to manage tasks is that these remaining issues are formally documented and must be commented on, resolved and closed by the project managerbefore the project is completed.

As I started out by saying, in terms of scale, the SKOS-HASSET development was only small but our decision to adopt the ‘sprinting mind-set’ was a sensible choice.  The Agile techniques of sprinting and having short, stand up morning meetings are insightful and not only deliver information, they act as glue between the team members and provide the focus and impetus to keep momentum going and deliver results in a short timeframe.

Posted in Uncategorized | Leave a comment

SKOS-HASSET Webinar: 28 March 2013, 10:00 – 11:00

The SKOS-HASSET team will be presenting the results of its work on 28 March 2013, at 10:00 GMT via a Webinar.  Please join us to hear more about the project.  Space is limited, so do sign up with GotoWebinar to reserve your place.  After registering you will receive a confirmation email containing information about joining the Webinar.

The webinar will describe the work undertaken in the SKOS-HASSET project, a 10-month, JISC-funded project in the UK Data Archive, University of Essex, which ran from June 2012 until March 2013.  Its aims were to:

1. Apply SKOS to HASSET
2. Improve its online presence
3. Test its automated indexing capacity

Simple Knowledge Organization System (SKOS) is a language designed to represent thesauri and other classification resources.  It encodes these products in a standardised way, using RDF, to make their structures comparable and to facilitate interaction.

The webinar will give an overview of the aims and objectives of the project, the technologies used and the results of the automated indexing work.  For this last piece of work, SKOS-HASSET was taken as the terminology source for an automatic indexing tool (KEA) and applied to question text, variables, abstracts and publications from the Archive’s collection.  The results were compared to the gold standard of humanly-undertaken indexing. These will be presented at the webinar.  SKOS-HASSET itself will also be demonstrated.

Webinar invitation details:

Title: SKOS-HASSET JISC-funded project: results and discussion
Date:    Thursday, March 28, 2013
Time:    10:00 AM – 11:00 AM GMT
Sign up URL:    https://www3.gotomeeting.com/register/897319998

System Requirements
PC-based attendees
Required: Windows® 7, Vista, XP or 2003 Server

Mac®-based attendees
Required: Mac OS® X 10.5 or newer

Mobile attendees
Required: iPhone®, iPad®, Android™ phone or Android tablet

We hope to see you there!

The SKOS-HASSET Team

Posted in Uncategorized | Leave a comment

Gold Standard Indexes for SKOS-HASSET Evaluation: A Review

Suzanne Barbalet

1. Introduction

The gold standard indexes we used for SKOS-HASSET training and evaluation were a combination of our in-house quality controlled indexes and a specially prepared index used for training Kea to index variables. The latter task was performed as exhaustively as possible to enhance training but could not ignore efficiency restraints. All indexes conform to ISO standards.

In–house indexing takes account of:

  • the perceived needs of the users
  • policy issues incorporating plans for the future development of the collection

Indexing data and indexing documents are slightly different processes:

  • Concepts within data are measurements and their definitions may vary
  • Concepts within data-related documentation have a general language definition

Indexing the former type of concept is a two stage process of translating its operationalized definition into a general language definition and from that to a thesaurus definition. The latter is a simpler process requiring some form of mapping between a general language definition and the thesaurus definition. This blog explains these two processes in greater detail and outlines how the indexes were prepared with reference to the functions each corpus was designed to perform.

2. Information retrieval requirements

Funded by ESRC, the JISC and the University of Essex, the UK Data Archive is committed to supporting secondary analysis; that is the re-use of quantitative, qualitative or historical data. Data analysis requires knowledge of and access to specialist statistical software such as SPSS or Stata; thus, the Archive’s users are a special clientele who rely on the Archive both to supply data in an appropriate form and to support their use.

Since Archive users are primarily data analysts, an important access requirement is variable information. Some users will be specialists in a particular data collection, others will have broader interests and require subject rather than variable access. Data analysts will wish to retrieve particular variables of interest or to locate studies of interest; survey managers, teachers of survey methods, social and economic researchers and post graduate students wishing to conduct pre-analysis will search for relevant survey questions and other information in the documentation.

To meet these needs the Archive has enriched its resources through the provision of the Nesstar service, the Survey Question Bank project and by producing Case Studies and Support Guides. These resources can be represented thus:

Fig 1. SKOS-HASSET Test Corpora

3. The Corpora

We used four corpora for testing Kea for the SKOS-HASSET project. They comprised:
ESDS Data Catalogue Records with existing study level keywords plus three supplementary corpora

§ Survey Question Bank PDFs with existing keywords providing access to documentation the depositor supplies with the dataset

§ Case Studies with existing subject indexes providing examples of how Archive studies have been used in teaching and research and Support Guides providing an overview of the data collections and internal procedures.

§ Nesstar Bank of Variables/Questions including only research datasets with a tailor-made index

These supplementary resources have evolved to provide public access to ‘data-related’ documentation and to view frequencies and make cross tabulations using the Nesstar service.

Procedures for cataloguing and indexing have evolved alongside the development of the Archive’s thesaurus HASSET. Initially HASSET was based on the UNESCO Thesaurus (1977). Now established as a general social science thesaurus, indexing is directed towards providing the best access to variables within the data. Limits are not imposed on the number of index terms permitted per data collection in order to maintain a high standard of information retrieval. Cataloguers select as many new terms as they find necessary which may vary between 10 and 450 terms. Data indexing tends to incorporate both broad and narrow terms to cover the variation of concept levels that variables may include.

Data are a complex object in information retrieval terms. When concepts are operationalized the indicators specified will rarely be found in the same thesaurus hierarchy nor necessarily have a lexical relationship. For example the Office for National Statistics webpage Different Aspects of Ethnicity recommends that the variables ‘Country of birth’, ‘Nationality’, ‘Language spoken at home’, ‘Skin colour’, ‘National/geographical origin’, and ‘Religion’ should together indicate the concept of ‘Ethnicity’. At the same time indicators such as ‘Religion’ may also contribute to the operationalization of other concepts. Providing this information, together with information on data collection methods, data preparation and results or findings is best practice for data depositors and is key to enabling the secondary user to make informed use of the data. It is information that is incorporated in the catalogue record abstract and subject categories (from DDI element of <topCclas>) which were used for Kea.

All four corpora then reference data collections in the data catalogue, indexed with HASSET terms. Since each corpus was arranged to meet specific user needs, different strategies of indexing were necessary for each corpus and thus training for Kea was done separately.

4. Manual Indexing Practices

Two principles of indexing are worth discussing in the context of providing a gold standard for Kea. They are:

  • Specificity
  • Associative terms

4.1 The Principle of Specificity

The central principle of indexing is to use the most specific term that entirely covers the topic (Lancaster, 1998: 28; de Keyser, 2012: 11: Broughton, 2004: 70). It may be the case that the controlled vocabulary or thesaurus does not include a term at the level of specificity required by a particular resource in which case either a higher level term is used or a more specific term needs to be added to the thesaurus (Lancaster, 1998:30).

Since the topics of the Case Studies and Support Guides are extremely broad, while a Nesstar Bank of Variables/Questions index requires very specific index terms, the level of specificity of indexing required for each corpus can be represented as below:

Fig 2. Level of Specificity of Indexing required by the content of the SKOS-HASSET Corpora

Case Studies and Support Guide Corpus
Case Studies and Support Guides have been collected together under a broad topic classification scheme and tagged with UK Data Archive Subject Categories. These were mapped to HASSET terms for the SKOS-HASSET project and became our gold standard. The case study “Unemployment and Psychological Well-Being” for example is classified under the topics LABOUR AND EMPLOYMENT + ECONOMICS, both of which are HASSET top terms for very large hierarchies.

Economic and Social Data Service (ESDS)/UK Data Service Catalogue Corpus
The main resource for information retrieval managed by the UK Data Archive is the ESDS (and soon to be UK Data Service) catalogue of studies. This is a union catalogue shared by members of the Service and includes records of a variety of data collections for quantitative analysis, including census data; data suitable for qualitative analysis; and historical research resources.

The aim of indexing a data collection is to retrieve a particular study as well as providing access to the variables available for analysis within the particular study.

While a cardinal rule of indexing is not to employ ‘multiple indexing’ (de Keyser, 2012: 12), that is to introduce redundant terms (Lancaster, 1998:280), a dataset is not one discrete object of information in the same way a document is. Within a data collection the object or objects to be retrieved are variables or questions as well as documents, or support guides, and all need to be accessed together with their associated documents. It is often necessary therefore to use ‘mutiple indexing’ or introduce what may appear to be ‘redundant’ terms, which explains the fact that in the catalogue record the high number of top terms together with specific terms, sometimes from the same hierarchy, can be necessary inclusions.

An example in this corpus, which also by chance was included in the Survey Question Bank Corpus, was Study 6843: British Gambling Prevalence Survey, 2010. Although it had been indexed with many top terms the British Gambling Prevalence Survey, 2010 had a total of 39 lower level terms among its 53 HASSET keywords in the ESDS Data Catalogue Records.

Survey Question Bank (SQB) PDF Corpus
The third corpus is the Survey Question Bank PDF collection of documents. In 2007 the Archive inherited an organized collection of survey documentation and enriched associated materials in PDF format from the University of Surrey (Qb). Survey documentation is processed at the point of ingest and access provided on the catalogue metadata page, however the format these materials arrive in the Archive is in no way standardized. A ‘question bank’ addresses this organization problem and, for similar reasons, many data archives have established ‘question banks’.

The SQB PDF proposed terms for HASSET included survey methods terms. These and other non-thesaurus terms that the Archive inherited from the original Qb questionnaires, processed pre-2007 by the Qb team at the University of Surrey, became ‘stop words’. The Archive inherited this organized collection of survey documentation and enriched associated materials in PDF format. It is a format in which the documentation is deposited and, as mentioned above, the PDF ‘keyword field’ limits the number of terms that can be entered which artificially reduced the number of keywords chosen. The selected keywords need to be comparatively specific in this corpus to allow both the retrieval of the document itself and particular questions together with information about innovative survey methodology.

The documentation includes questionnaires, interview instructions, letters, consent forms, interviewer observations, nurse schedules, technical reports and user guides. For the larger surveys a number of questionnaires may have been administered, each of which is indexed. In these cases more specific additional terms may have been required than were in the study list of keywords. In our example of the British Gambling Prevalence Survey, 2010, however, there was only one questionnaire so most of the original gold standard keywords provided by the catalogue record could be used for the SQB PDF document. In the ESDS/UK Data Service Catalogue Record Corpus 14 top terms were used for the study British Gambling Prevalence Survey, 2010, while the SQB PDF corpora used only 2 top terms, which included GAMBLING (the topic of the survey) and EMOTIONAL STATES (a very small hierarchy), reflecting the need for the catalogue record keywords to be broader in scope than the keywords required for particular items of survey documentation.

Nesstar Bank of Variables/Questions Corpus
The final corpus is a database of 26753 indexed variables from 35 research datasets available in the Economic and Social Data Service (ESDS)/UK Data Service Nesstar catalogue. Nesstar is a free online data analysis tool.

For the SKOS-HASSET project 26753 variables were indexed in the period between mid- August and the end of October 2012. These variables were either associated with questions, parts of questions or marked sections of the administered questionnaire. Unlike SQB indexing Nesstar indexing was undertaken to locate variables, whereas SQB indexing was undertaken to locate questions.

While variables appear similar to questions a single question can be translated into multiple variables (for example a question with five possible, exclusive multiple choice answers can result in five variables). Survey datasets are made up of the codes corresponding to variables encoded either by a Blaise program or encoded manually or a combination of both, taking account of responses to survey questions. The codes may be binary or a more complex array. Where the answer to a question could be ‘yes’/’no’ responses then a variable and question can be the same entity of text. However, where the question requires an open numerical response, (for example how many hours it has taken to complete a task), a variable cannot be meaningfully indexed. In addition some variables are derived. Though present in the data derived variables did not appear in the Nesstar Bank of Variables/Questions Corpus which avoided a difficulty as challenging to manual indexing as it would be for Kea.

Though Nesstar does group variables in a logical arrangement when indexing the more complicated Blaise structured questionnaires it is not easy to attach follow-up questions to the source question without referring back to the questionnaire which is a time consuming process. Individual questions out of context of the schedule that delivered them cannot be hand indexed with ease nor with ensured accuracy. In addition a variable may be constructed via a standard measure such as a copyrighted scale. The component parts of the scale are questions but whether these comprise one or ten variables will be a coding decision and for important questionnaire design reasons questions that comprise the scale are not always asked in sequence.

Addressing these problems in the task of manually indexing the Nesstar Corpus meant much more time was required to perform variable indexing than indexing a questionnaire. Indexing by variable led to much repetition of keywords and many keywords per variable to achieve specificity. Not all individual variables were suitable for indexing. The task was made especially difficult by the fact that in the majority of cases Blaise schedules were used for data collection and thus the sequence in which the question is asked is not easy to follow. Although efforts were made not to repeat the indexing process when a question was encountered again, and variables are entered twice when they are found to belong to two or more groups, this was not always possible. As a result, the labour required to index variables individually was multiplied in a similar fashion. These issues increased the work required to undertake the indexing from an estimated 27 person days to 34 person days (fte) work, inclusive of planning and testing evaluation forms; that is from an estimated 1000 files per day down to 800 files per day.

An example will be given below with reference to study 5294 to illustrate that indexing at variable level required extra, more specific keyword than the catalogue record list.

4.2 Associative terms

Lancaster (1998: 30) says beyond the principle of specificity no real rules of indexing have been developed, only theories. Nevertheless he proposes two rules which, when followed, can lead to the use of “associative terms”. They are:

  • Include all the topics known to be of interest to the users of the information service that are treated substantively in the document
  • Index each of these as specifically as the vocabulary of the system allows and the needs or interests of the users warrants (Lancaster, 1998: 31)

“Associative terms” should be distinguished from “related terms”. “Related terms” are terms within the same hierarchy. Generally, in indexing practice, this is “redundancy” or “multiple indexing” and not recommended. In practice however for indexing variables, some “multiple indexing” may be required. In evaluation we will look at the context before assigning this category. “Associative terms”, on the other hand, are terms that are not related in hierarchical structures of a thesaurus (Lancaster, 1998:15; Broughton, 2006: 129). They are terms used in indexing in combination to cover a concept and may be just as necessary when indexing a variable or question as it is in larger textual contexts. A common example in the ESDS Data Catalogue Records Corpora is the combination of the terms CHILDREN + HEALTH to cover the concept of CHILD HEALTH.

The use of “associative terms” is generally not easy to see in a long list of keywords but is very apparent when the text is as short as it is in the variable files of the Nesstar Corpus. Their use will depend both on how complex the concept may be and also the availability of suitable terms within the thesaurus. There will be a lead term followed by “qualifiers”, each of which is dependent on the previous term (Lancaster, 1998: 56). These “qualifiers” are “context dependent”.

An example in the Nesstar Bank of Variables/Questions Corpus is a variable from study number 5294 Workplace Employee Relations Survey, 2004: Cross-Section Survey, 2004 And Panel Survey, 1998-2004; Wave 2. The question is:

For each of the above groups of employees, how many are in each of the following occupational groups? Protective and personal services: Full-time females”.

In the questionnaire ‘personal services’ are defined as ‘caring, leisure and other personal service occupations’. The HASSET term SERVICE INDUSTRIES covers ‘leisure and other personal services as well as protective services’. It does not cover ‘caring’. The HASSET term CARE comes with an instruction to choose a more specific term, while CAREGIVERS refers to “non-professionals”, therefore the allocated keywords were kept at the level of specificity illustrated in Figure 3.

Fig 3. Associative HASSET terms to describe a variable in the Workplace Employee Relations Survey, 2004: Cross-Section Survey, 2004 And Panel Survey, 1998-2004; Wave 2 dataset.

In this ‘context-dependent’ combination of ‘associative terms’ though SERVICE INDUSTRIES refers to OCCUPATIONS the variable is one in a series of questions about ‘occupations’ so it is important to include this term to allow for analysis just at the level of OCCUPATION. In the catalogue record the list of keywords include OCCUPATIONS, WOMEN and FULL-TIME EMPLOYMENT. It does not include SERVICE INDUSTRIES. Thus indexing for the Nesstar Bank of Variables/Questions Corpus requires a greater level of specificity in the selection of keywords than was required for indexing the ESDS Catalogue Record Corpus.

In this example the question is one of a series on “occupations”. Other connected questions will not necessarily follow in the same sequence. The problem is that of how a variable may be operationalized, as discussed above. Indexing completely at variable level, which the Nesstar Corpus required, will not necessarily group the variables in a convenient manner for the indexer to translate the concept’s operationalized definition back into a general language definition. Principles of question design make it mandatory that all respondents will attribute the same meaning to the question/variable but that meaning does not have to be the measured concept which indexing will aim to capture. For the variable ETHNICITY, discussed above, RELIGION is an indicator. Religion questions are harmonized to ensure that the meaning will not be in anyway ambiguous and the measurement is standard. However the respondent may not know that data is being collected on ETHNICITY. If the measurement is a subjective one however, that is if the respondent is asked what ethnic group they belong to, then a straight forward HASSET term ETHNIC GROUPS will apply (see the ONS guide Ethnic Group Statistics: A Guide for the Collection and Classification of Ethnicity Data).

However it should be kept in mind that these associations do rely on the availability of suitable terms within the thesaurus that follow the rules of word combination. There are word combinations that should not be split in order to preserve meaning (de Keyser, 2012: 20).

5. Conclusion

The level of specificity of gold standard indexing for our four corpora reflected the type of information object a user may wish to retrieve. The Case Studies and Support Guides Corpus is a relatively small collection of documents and requires a few broad terms to ensure retrieval. Other corpora require the application of a greater degree of specificity in the keywords chosen as well as the use of associative terms, which may or may not be visible depending on the size of the text file.

Nesstar indexing was undertaken to locate variables, which in the majority of cases match questions or parts of questions and sometimes sections of the questionnaire. They are complex in design and difficult to process individually. SQB indexing on the other hand was undertaken to locate questions. Both corpora require a considerable level of specificity in the selection of keywords to cover concepts of some complexity that will at times exhaust the availability of terms within a thesaurus hierarchy and perhaps require it to be extended.

We expect to release the evaluation results in the next few weeks.

References

Broughton, V. (2004) Controlled Indexing Languages. In Essential Classification. London, Facet.

Broughton, V. (2004) Faceted Classification. In Essential Classification. London, Facet.

Broughton, V. (2006) Essential Thesaurus Construction. London, Facet.

Bulmer, M. ed. (2010) Social Measurement through Social Surveys: an applied approach. Farnham, Ashgate.

De Keyser, P. (2012) Indexing: From Thesauri to the Semantic Web. Oxford, Chandos.

Hyman, L., Lamb, J. and Bulmer, M. (2006) The Use of Pre-Existing Survey Questions: Implications for Data Quality. Proceedings of European Conference of Quality in Survey Statistics, Rome. Retrieved on 05/01/2013.

Kneeshaw, J. (2011) The UK’s Survey Question Bank: Present and Future Developments. Paris, Question Database Workshop, Reseau Quetelet.

Lancaster, F. W. (1998) 2nd ed. Quality of Indexing. In Indexing and Abstracting in Theory and Practice. Champaign, Illinois, University of Illinois.

Office for National Statistics (2003) Ethnic Group Statistics: A Guide for the Collection and Classification of Ethnicity Data. Retrieved from http://www.ons.gov.uk on 05/01/2013.

Posted in Evaluation, Indexing | Leave a comment

2012 review, and a look forward to 2013

As the year draws to an end, it feels right and proper to reflect on what we have achieved so far in the SKOS-HASSET project.

The project began in June and, since then, it has done the following:

We have more work to do, which will take us into the Spring of 2013.  Forthcoming tasks include:

  • Finishing the manual evaluation of the automatic indexing exemplar
  • Writing the automatic indexing recommendations report
  • Applying version control to the thesauri
  • Releasing a new version of the thesauri
  • Releasing SKOS-HASSET on the web, via Pubby, using the recommendations of the licensing report
  • Running a webinar on the use of the thesaurus
  • Creating a leaflet to encourage wider use of the thesaurus

We are looking forward to working more with SKOS and HASSET in the New Year.  In the meantime, the SKOS-HASSET project team would like to wish all its blog readers a very Happy Christmas!

Lucy Bell, UK Data Archive

Posted in Project Management | Leave a comment

From Tuples to Triples: applying SKOS to HASSET – a technical overview

Darren Bell

1. Introduction

The UK Data Archive (the Archive) has been experimenting with RDF for a couple of years. When the funding was secured from JISC to apply SKOS to HASSET (Humanities and Social Science Electronic Thesaurus), it was a welcome opportunity for the development team to create a real-life production instance of an RDF dataset that was of a manageable size and which was relatively static.  This post gives a brief overview of some of the technologies that the Archive has deployed to deliver a SKOS-based thesaurus.

2. Applying SKOS to the existing HASSET Thesaurus

HASSET is currently stored as relational data in Microsoft SQL Server.  The challenge for us was to ‘translate’ the existing relationships as defined in traditional rows and columns into RDF triples.

Each term in HASSET has an explicit relationship type (coded as an integer) to another term.  Happily, these relationship types map closely onto the main SKOS predicates:

HASSET Relationship Between x and y Proprietary
Code
Related SKOS Predicates for each skos:Concept
Is a Broader Term For 5 skos:broader
Is a Narrower Term Of 6 skos:narrower
Is a Synonym Of 4 skos:altLabel
Should Be Used For 2 skos:prefLabel
Is a Related Term Of/For 8 skos:related
Is a Top Term For 7 skos:topConceptOf &
skos:hasTopConcept

Additionally, each SKOS Concept has an additional predicate skos:inScheme, which simply states that the SKOS Concept is in the SKOSHASSET Thesaurus.

3. Creating and Serializing RDF data

Once we understood the constituent SKOS parts of our desired RDF triples, we wrote a “SkosHassetGenerator” class to iterate through each table row, examine the relationship type and generate the appropriate SKOS triple.  The UK Data Archive is primarily a .NET organisation and we referenced a number of C# libraries from http://www.dotnetrdf.org/ which is open-source and in turn uses JSON.Net for JSON serialization.  The dotNetRDF libraries are well-documented and it this was helpful in generating several serialisations of SKOSHASSET, namely RDF/XML, RDF/JSON, Turtle (which seems to be increasingly popular), NTriples and CSV.

A fundamental part of RDF is a persistent, dereferenceable URI for each SKOS Concept.  In testing we used a local, temporary URI.  We are planning to move to a humanly-meaningful and logical URI on release, however, which will be a subdomain of data-archive.ac.uk.  This is currently being set up and we expect it to be lod.data-archive.ac.uk/skoshasset/<GUID>.  More information on the precise URI will follow.

The “SkosHassetGenerator” class is run as a scheduled console application from our Jenkins server on a daily basis.  As well as generating physical text files on a network share (which is both useful as a snaphot-archiving mechanism and for physical download to end-users), the class writes the triples into a dedicated Triple Store (see below).

We have identified several benefits to our application of SKOS.  Not only does it make the terms easier to maintain and manipulate, but it has also meant that the thesaurus can be more thoroughly validated by 3rd party online tools. One particular favourite of ours is PoolParty.

4. Persisting RDF data

Having generated a SKOS version of the HASSET thesaurus as RDF text files, the next stage was to be able to persist these data in a Triple Store which would allow querying of the data via SPARQL.  As we are primarily a .NET house, we selected BrightStarDB.  This is open source and is itself built on the same dotNetRDF classes we used to generate the triples in the first place.  BrightStarDB also allows us to easily configure a SPARQL endpoint on IIS7 and provides support for Microsoft Entity Framework, which is the Object Relational Mapper we normally use to connect our web services infrastructure to back-end databases.

5. From RDF data to Linked Data

Following on from populating the Triple Store and establishing a SPARQL endpoint, the next stage was to make the SKOS Concept URIs  publicly dereferenceable and useful to the wider user community.  This initially presented us with a headache.  We have approximately 7000 unique terms (or SKOS Concepts) in HASSET.  How do you maintain 7000 persistent identifiers on a web server and deliver both HTML and RDF content to users and machines?  Fortunately, following some research, we identified another open source product, called Pubby, based on Java components, principally Tomcat, Jena and Velocity.  With minor configuration and stylesheet changes, this enabled us to set up a web server to point to our SPARQL endpoint and deliver HTML or RDF content as requested by the end user or machine.

Most of the work has now been completed in terms of applying SKOS to our HASSET Thesaurus.  All that remains is to make the SKOSHASSET SPARQL endpoint and Pubby publicly available for testing.  This will be completed by the end of the project.

Fig 1.  Schematic for generating SKOS Linked Data from HASSET Thesaurus

RDF_schematic

Posted in RDF, SKOS-HASSET | Leave a comment

Licensing for SKOS-HASSET: WP4 deliverable – the SKOS-HASSET Licence Recommendation Report

1.0 Background

The Archive maintains two thesauri: the first, the Humanities and Social Science Electronic Thesaurus (HASSET), is owned entirely by the University of Essex and contains subject terms covering all the social science disciplines; the second, the European Language Social Science Thesaurus (ELSST) takes the core, internationally-applicable terms from HASSET and translates them into a number of European languages. The University of Essex owns the Intellectual Property (IP) in some of this product, but not all.

HASSET, in non-SKOS form (usually as a .csv or a PDF) has been made available for not-for-profit use for many years; however, access to the full set of terms and their relationships has always been granted only after a licence has been signed by the recipient and returned to the UK Data Archive, University of Essex. Indeed, a new licence template was developed in 2011-2012, based initially on the JISC Model Proforma, to ensure that the rights of the University were being adequately protected. It is important to note that in the past this licence has always referred to the intellectual, creative content of the thesauri – the hierarchies and their relationships – and not the database or syntactical structure. A licence has always been applied to the thesauri in the past in order to protect a) the integrity of the terms and b) the quality of the translations. The current licence has a dual purpose: it covers both the use of the thesaurus as an indexing tool as well as regulating its translation into further languages.

Additionally and importantly, the UK Data Archive would like to expand further the membership of the HASSET/ELSST user community (already enhanced during the SKOS-HASSET project). The release of the SKOS product provides an ideal opportunity to further the work of the Archive in communicating with, and learning from, its thesaurus users. To do this, however, the Archive would need to know who its thesaurus users were.

Because of the expected need to maintain the quality and integrity of the thesaurus, this Licensing Report reviews and makes recommendations for the licence conditions under which the SKOS-enabled thesaurus product (SKOS-HASSET) may be delivered.

1.1 Previous IP work

Investigative work covering the Intellectual Property Rights (IPR) and associated licences in use in relation to HASSET and ELSST took place in 2011. In summary, this work found that:

1. the IPR in HASSET are owned by the University of Essex, but the IPR in parts of ELSST are owned by third parties;
2. the licensing system in relation to ELSST is complicated by the need to make provision for quality translations;
3. some users of, and indeed some of the ELSST IPR owners, are outside UK academia.

The University of Essex owns all of the Intellectual Property (IP) in HASSET and a lot of the Intellectual Property in the current version of ELSST. Archive staff undertook some of the original ELSST translation work for the French, German and Spanish translations and created the structure and framework for the thesaurus, which was based on HASSET. It was developed here in the Archive during the LIMBER project (January 2000 – June 2001), but has been developed further since then. As such, the Archive holds the IP in:

• the database structure;
• the thesaurus structure/hierarchies;
• the core terms;
• the English extensions, which include terms specific to British life, government and administration (effectively HASSET);
• the Spanish translations;
• the German translations made up to and including 9 December 2005;
• the French translations made up to and including 30 June 2001.

The remaining IP is held by other, external organisations or individuals. These Intellectual Property Rights relate to:

• the Finnish translations and extensions;
• the Greek translations and extensions;
• the Norwegian translations and extensions;
• the Danish translations and extensions;
• the Swedish translations and extensions;
• the Lithuanian translations and extensions;
• the German translations and extensions to be released in the next version;
• the French translations and extensions made from 1 November 2005.

A brief review of existing licence models was undertaken in May 2011. This review identified other thesaurus products, external to the Archive, with:

• no licences;
• Open Government Licences;
• and more restrictive, detailed licences.

These products were not all SKOS-enabled. All these licence models, as well as the licence template included in the JISC IPR toolkit, were examined.

The JISC model was taken as the starting point in developing a new licence; this licence model ‘contains more favourable provisions than any standard commercial licence for access and use of online resources’ (Korn, 2011). Nonetheless, HASSET and ELSST’s needs were simultaneously not as complex as all of those catered for under the JISC proforma and more complex. The JISC proforma includes provision for institutional responsibilities in relation to resources being made available to staff and students, which a HASSET/ELSST product did not require; on the flipside, any licence covering ELSST and HASSET must also protect not only the integrity and quality of the product when being used, but also when being translated.

As part of this work an initial licence was set up in 2011, based on the JISC exemplum. This has since gone through much iteration. The final licence was approved by the University of Essex’s Research Enterprise Office on 13th March 2012. This is the licence that is currently in use for HASSET and ELSST.

2.0 Copyright issues and thesauri

Copyright law and technology have long been at odds with each other. The Hargreaves report from 2011 includes an entire section on this issue, describing in detail how copyright law has held up technological – and, in some cases, societal – advances. It says:

‘So the question is how to build in sufficient flexibility to realise the benefits of new technologies, without losing the core benefits to creators and to the economy that copyright provides.’ (Hargreaves, 2011)

One of the ways that technologists have tried to address this is through the creation and use of open source systems and the application of Creative Commons licences. This approach works reasonably well; however, there are two key problems in its application to a thesaurus:

1. Creative Commons licences only apply to creative works; they do not cover data or databases (Miller, Styles and Heath, 2008). The question of whether the intellectual property rights in a hierarchy structure would be covered by a Creative Commons licence is an open one.

2. A thesaurus is an authorised set of terms which describe an aspect of the world, society or a discipline. Through their authorisation, they are both descriptive and prescriptive – and this is an important pairing. Thesauri are living, dynamic tools, being updated to reflect changes in the world; however, for these changes to carry authority, they must be made by a single organisation – the thesaurus owner. Each thesaurus should exist in a single, controlled and authorised form. If this does not happen, the integrity of the terms is under question.

Importantly, RDF and SKOS are simply one type of data format that may be applied to a file. Méndez and Greenberg (2012) describe ‘linked open vocabularies as a part of the new knowledge organization ecosystem’. They go on to explain that research has shown that the ‘subject’ or conceptual-type search is the most common type of search on the web. This seems to raise the need for both inter-linked thesauri and controlled vocabularies which try to describe the world, but also for high quality thesauri and controlled vocabularies. Using Linked Data formats is entirely appropriate here; however, in order to maintain the quality of the product, these formats do not necessarily have to be Linked Open Data. Sanchez, Mendez and Rodríguez-Muñoz (2009) explain how the use of SKOS enhances both the user’s and the provider’s experiences: ‘From the user perspective, the use of thesauri developed with the SKOS model affects … those who use them through query operations. For information managers, SKOS offers a closer approach to knowledge organization and management, complementing the automatic extraction of textual content from documents with its indexing through conceptual entities’. There is little doubt that SKOS is the ideal format for index terms both to index conceptually and automatically. The question is how to do this while still maintaining quality control?

A key, recent work on the applicability of various conditions sets to academic tools is the review of the openness of licences undertaken by the Naomi Korn Copyright Agency on behalf of the JISC (Korn, 2011). This review analyses the various licences available to data or resource producers, including the JISC Model Licence, Creative Commons licences, the JISC Collections Open Educational User Licence v 1.0, Open Data Commons and the Open Government Licence.

Of these, the most appropriate licences which could be considered for the SKOS-HASSET project would be the JISC ones, Creative Commons and, possibly, the Open Data Commons licence. The Open Government Licence refers primarily to Government information and so, although a contender, may not be ideal. The JISC Model Licence is the de facto type licence in use currently for HASSET and ELSST as it was taken by the University of Essex as the proforma for the existing HASSET and ELSST licence.

Korn’s work outlines two of the key issues which would prevent the use of Creative Commons Licences in relation to SKOS-HASSET; the review states that:

1. Creative Commons Licences may not be suitable ‘where third-party issues are present and require additional clearance.’
(This is definitely the case for ELSST where the majority of the translations are owned by other, non-UK organisations and individuals.)

2. ‘At a strategic level, committing to the irrevocable terms of CC licences raises issues of broader access and commercial goals for organisations.’
(Once set up, it would be difficult to reverse the terms of Creative Commons licences. While the CESSDA ERIC, the eventual legal entity in relation to European data archives, is still being established, flexibility in terms of being able potentially to change licence conditions is essential.)

The Open Data Commons Licence, set up as an open solution to data or databases (rather than creative works), may be a contender in relation to the HASSET/ELSST database structure, as might the GNU General Public License. GNU is a free, copyleft licence for software and other kinds of works. It is intended to guarantee a permit to the creators and users of a work to share and change all versions of it and to make sure all versions remain free for all. All derived works to have come from a database licensed under GNU must abide by the terms of the original GNU licence.

It is the very inclusion of a database alongside the creative work it holds which makes the use of one or other of these licences problematic though. The big problem with using either of these licences would be the need to include a second licence which covers the intellectual, creative content – the hierarchies – as neither of these licences covers this sort of information. Releasing the thesaurus products under one of these data/database type licences, may result in users taking these to mean that any changes may be made to the terms and re-released under the same terms and conditions. Although the terms of the licences would not permit this, confusion could still ensue. Doing this would also create a multi-licence situation, which would complicate matters, rather than simplify them.

Related to this, Naomi Korn identifies some circumstances in which it may not be appropriate to use open licences. Two of these match the circumstances of HASSET/ELSST; she states that ‘situations where [considering the placing of ‘some’
restrictions upon the user, such as ’No derivative works’ (‘ND’) and/or ‘Non-commercial’ (‘NC’) restrictions] will need to be made include the following’:

1. inclusion of data and/or databases
(one of the key issues here is that the University of Essex owns the IP in the database and syntactical structure);

2. inclusion of third-party-generated content for which permissions have not been cleared
(again, this is very pertinent in relation to ELSST).

Korn suggests that in these circumstances, one could use a licence with a ‘no derivatives’ attribution, a licence with a ‘no commercial use’ attribution or a licence that restricts certain classes of users from being able to access resources. These options, while entirely suitable for the academic community, still prevent access to, or use of, a variety of resources to third parties outside UK education.

The SKOS-HASSET project team does not wish to restrict access to, or even the re-purposing of, its tools and products on the basis of user types or geography, especially as many related thesauri are owned and developed in non-UK and sometimes non-educational arenas. Online browsing of individual concepts and their relationships should be maintained. In terms of the full set of hierarchies, however, the project team requires the ability to maintain the integrity of its products in their entirety and to be offered any derivations. The product may end up being of interest to a community wider than academia and, in fact, colleagues in the commercial, publishing sector have already expressed an interest in it. Rather than restricting users, surely it would be better to make the thesaurus available to anyone and everyone, but behind a simple bespoke and effective licence, even if this is slightly more restrictive than Creative Commons?

Korn supports this. She states that ‘whilst undoubtedly there are numerous benefits associated with the use of ‘open’ licences and the creation of truly Open Educational Resources, which are repurposable and reusable, there are clearly circumstances … where this is not feasible’. Circumstances in which a) the widest community of users, including potentially those from commerce or overseas would be prevented from gaining access to the product, b) the quality and integrity of the product may be jeopardised and c) IPR are shared among a variety of organisations and individuals are likely to be those within which an open licence may not be the best way forward.

2.1 Product integrity

Naomi Korn suggests that where there is a question mark over the use of an open licence, ‘the priorities of the initial licensor of the content need to be based upon an open vs risk evaluation, rather than openness only’ (Korn, 2011). The SKOS-HASSET project has undertaken just such an open vs risk evaluation:

Item Predicted risks with open licence Predicted advantages of open licence Benefit score (1-5, 1=least benefit) Likelihood of risk
(1-5, with 1=least likely)
Severity of risk
(1-5, with 1=least severe)
Risk score Adjusted risk score (Risk score – Benefit score)
Hierarchies may be changed without licensor’s knowledge Terms would lose their integrity; relationships may be broken; SKOS may no longer validate; archives/services using HASSET/ELSST for indexing may not be using the authorised version; multiple versions may proliferate; lack of consistency and harmonisation in versions used would undermine existing, and restrict future, efforts to build tools for multisite use (e.g. discovery of similar/comparable resources in other countries) Product may be used widely; product may be made more appropriate to local needs 5 5 5 25 20
Hierarchies may be added without licensor’s knowledge Relationships may be broken; SKOS may no longer validate; archives/services using HASSET/ELSST for indexing may not be using the authorised version; multiple versions may proliferate; lack of consistency and harmonisation in versions used would undermine existing, and restrict future, efforts to build tools for multisite use (e.g. discovery of similar/comparable resources in other countries) Product may be used widely; product may be made more appropriate to local needs 5 5 5 25 20
Derived version of HASSET or ELSST may be released by third party organisation Terms would lose their integrity; archives/services using HASSET/ELSST for indexing may not be using the authorised version; multiple versions would proliferate; lack of consistency and harmonisation in versions used would undermine existing, and restrict future, efforts to build tools for multisite use (e.g. discovery of similar/comparable resources in other countries) Product may reach previous non-user communities 2 3 5 15 13
New translations may be made of the core terms Quality control of translations would not be made centrally; like-for-like translations may be attempted; authorised version would not receive the new translations Product may be translated into more languages than at present 4 4 5 20 16
Derived version of HASSET or ELSST may be sold by third party organisation Terms would lose their integrity; archives/services using HASSET/ELSST for indexing may not be using the authorised version; multiple versions would proliferate; lack of consistency and harmonisation in versions used would undermine existing, and restrict future, efforts to build tools for multisite use (e.g. discovery of similar/comparable resources in other countries); academic community would suffer; legal proceedings may follow Product may reach previous non-user communities 2 1 5 5 3

Table 1: Open vs risk evaluation

This led on to a further analysis, weighting the risk with any benefits:

 

Item Benefit score

(1-5, 1=least benefit)

Risk score Adjusted risk score (Risk score – Benefit score)
Hierarchies may be changed without licensor’s knowledge 5 25 20
Hierarchies may be added without licensor’s knowledge 5 25 20
Derived version of HASSET or ELSST may be released by third party organisation 2 15 13
New translations may be made of the core terms 4 20 16
Derived version of HASSET or ELSST may be sold by third party organisation 2 5 3

Table 2: Adjusted risk scores

The risks involved with making the thesauri open do not appear to be outweighed by the benefits in this case. The need to retain control of the quality and integrity of the terms, their relationships and the product as a whole is more important than the need to make the product entirely freely available. That is not to say that the product should not be available to all, simply that an element of control, exercised through a licence mechanism, would ensure, in this case, that the product remains of sufficient quality to be useful to all in the future. Any proliferation of derivatives of the thesaurus could, in fact, result in a diluted and less trustworthy product.

3.0 Precedents

A review of the methods of access of existing SKOS products has also been undertaken. The following thesauri have been examined:

• Agrovoc (http://aims.fao.org/standards/agrovoc/about)
• Decimalised Database of Concepts (http://ontologi.es/decimalised/decimalised.rdf)
• Eurovoc (http://eurovoc.europa.eu/drupal/)
• GEMET (http://www.eionet.europa.eu/gemet)
• GeoNames (http://www.geonames.org/)
• IVOAT Thesaurus (http://www.ivoa.net/rdf/Vocabularies/vocabularies-20091007/IVOAT/IVOAT.html)
• Library of Congress Subject Headings (http://id.loc.gov/download/)
• NAL Thesaurus (http://agclass.nal.usda.gov/)
• ONKI portal (http://onki.fi/)
• PICO Thesaurus (http://www.culturaitalia.it/opencms/export/sites/culturaitalia/attachments/thesaurus/4.3/thesaurus_4.3.0.skos.xml)
• RAMEAU (http://rameau.bnf.fr/informations/rameauenbref.htm)
• STW Thesaurus for Economics (http://zbw.eu/stw/versions/latest/about)
• TheSoz (http://www.gesis.org/en/services/research/thesauri-und-klassifikationen/social-science-thesaurus/)

A further list of SKOS-enabled thesauri can be found at http://www.w3.org/2001/sw/wiki/SKOS/Datasets.

Thesaurus Coverage File formats for download Comments Licence
Agrovoc The AGROVOC thesaurus contains more than 40 000 concepts in up to 22 languages covering topics related to food, nutrition, agriculture, fisheries, forestry, environment and other related domains. Supported:
RDF/XML
N-triples
Web Services
Unsupported:
SKOS RDF/XML English only
MySQL
Protégé DB
OWL
Re-authentication takes place using email addresses only (once registered, users need only enter their email address again to gain access a second time) Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
Decimalised Database of Concepts The Decimalised Database of Concepts is a collection of topics suitable for use in linked data. It is inspired by the Dewey Decimal Classification, but no guarantees are made about the closeness of its resemblance as a whole. SKOS mapping links are provided from this database to the Dewey system, to Library of Congree Classification codes and to DBPedia resources where possible. XHTML+RDFa 1.0
RDF/XML
N-triples
Freely available
Eurovoc EuroVoc is a multilingual, multidisciplinary thesaurus covering the activities of the EU and the European Parliament in particular.  It is managed by the EU Publications Office. It contains terms in 22 EU languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish), plus Croatian and Serbian. SKOS/RDF
XML
The licence acceptance is taken off line.  Users must email a dedicated mail box, asking for access to Eurovoc.  Once granted, users are sent a PDF copy of the licence, plus a username and login. EU-specific licence, which must be accepted off-line.
GEMET GEneral European Multilingual Environment Thesaurus. GEMET has been developed as an indexing, retrieval and control tool for the European Topic Centre on Catalogue of Data Sources (ETC/CDS) and the European Environment Agency (EEA), Copenhagen. HTML for import into MS-Access
RDF (themes and groups relationships)
SKOS/RDF (broader/narrower relations)
Freely available
GeoNames The GeoNames geographical database covers all countries and contains over eight million placenames that are available for download free of charge. Free of charge:
individual Gazetteer files (one per place)
Some free post code files
Costed services also exist.
Creative Commons Attribution 3.0 License
IVOAT Thesaurus Astronomical terms. RDF
Turtle
Freely available
Library of Congress Subject Headings Library of Congress Subject Headings (LCSH) has been actively maintained since 1898 to catalogue materials held at the Library of Congress. LCSH in this service includes all Library of Congress Subject Headings, free-floating subdivisions (topical and form), Genre/Form headings, Children’s (AC) headings, and validation strings for which authority records have been created. The content includes a few name headings (personal and corporate), such as William Shakespeare, Jesus Christ, and Harvard University, and geographic headings that are added to LCSH as they are needed to establish subdivisions, provide a pattern for subdivision practice, or provide reference structure for other terms. RDF/XML
Turtle
N-triples
Freely available
NAL Thesaurus The USA’s National Agriculture Library (NAL) thesaurus and glossary are online vocabulary tools of agricultural terms in English and Spanish and are cooperatively produced by the NAL, US Department of Agriculture, and the Inter-American Institute for Cooperation on Agriculture through the Orton Memorial Library, the Mexican Network of Agricultural Libraries (REMBA), as well as other Latin American agricultural institutions belonging to the Agriculture Information and Documentation Service of the Americas (SIDALC). XML
RDF-SKOS
PDF
MARC
DOC
User must click to accept the Usage conditions. NAL-specific terms and conditions of use
ONKI portal The ONKI service contains Finnish and international ontologies, vocabularies and thesauri needed for publishing content on the Semantic Web. OWL
RDF/XML
Turtle
SKOS-RDF
RDFS
ONKI provides access to many different vocabularies and ontologies. Files either freely available or behind a Creative Commons 3.0 License
PICO Thesaurus The Dictionary of Italian Culture is a controlled vocabulary designed for subject indexing and classification of heterogeneous resources, sourced from different cultural contexts. XML The XML has been put through an RDF to HTML stylesheet so is both humanly readable on screen and may be saved as RDF.  Nothing needs to be agreed before viewing or saving become possible. Creative Commons 2.5 License (Italian)
RAMEAU RAMEAU consists of a vocabulary of interconnected terms and an indicative syntax for the construction of subject headings.  It includes a set of authority records (common names and geographical entities). RAMEAU is enriched progressively through proposals from its network of users. None The web site states:
‘Fournitures de fichiers de données brutes: Des fichiers de données brutes peuvent être fournis, portant : soit sur des produits courants : les notices RAMEAU (telles que définies dans Périmètre des autorités RAMEAU), créées, modifiées (à l’exception des modifications induites par des liens dans d’autres notices) et annulées pendant la période considérée ; soit sur des produits rétrospectifs : l’ensemble des notices RAMEAU (telles que définies dans Périmètre des autorités RAMEAU) à une date donnée.’  (=‘Availability of raw data files:  raw data files can be supplied that relate either to:current products : RAMEAU records (as defined in ‘Scope of RAMEAU authority records’), created, modified (except for changes caused by links in other records) and cancelled during the period in question; or retrospective products – all RAMEAU records (as defined in ‘Scope of RAMEAU authority records’) at a given date.)[1]
The user is directed to the Produits et services bibliographiques page of the Bibliotheque Nationale de France (BNF), which provides a further link to Produits bibliographiques.  This last page includes a price list.  Autorites RAMEAU for example costs between E1,000 and E1,196 per installation.
BNF-specific licence, which must be accepted off-line.
STW Thesaurus for Economics This thesaurus relates primarily to economics; it contains more than 6,000 standardized subject headings and about 19,000 entry terms to support individual keywords. It also contains technical terms used in law, sociology, or politics, as well as geographic entities. RDF/XML
Turtle
N-triples
Optional email list available to sign up to for announcements relating to the thesaurus but no need to sign up prior to download. Creative Commons Attribution-Noncommercial-Share Alike 3.0 Germany License
TheSoz The Thesaurus for the Social Sciences (Thesaurus Sozialwissenschaften) is used for keyword searching in SOFIS (Social Science Research Information System) and SOLIS (Social Science Literature Information System). The list of keywords contains about 12,000 entries, of which more than 8,000 are descriptors (authorised keywords) and about 4,000 non-descriptors. Topics in all of the social science disciplines are included.  It has been translated into French and English. SKOS-XL
RDF/XML
Turtle
Users must complete an online form prior to download.  This captures name, organisation and email address.  Nothing needs to be agreed in advance.
As well as the downloadable data, TheSoz is available in a Linked Data HTML representation.  A SPARQL endpoint (using Pubby) is also available as a technical interface.
Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Germany License.

[1] Translation provided by Dr Lorna Balkan, 17 October 2012.

Table 3: Licence arrangements for other thesauri

Of the twelve thesauri included above, five (42%) have Creative Commons licences, four (33%) are freely available and three (25%) maintain their own, bespoke licensing arrangements. (Onki provides access to thesauri and other tools that are either freely available or protected by Creative Commons.) It is clear that the licensing landscape contains many, varied features, with different licence models in operation.

Recommendations

Following the brief review of existing work in this field, of risks and precedents and of the particular needs of a HASSET/ELSST product, this report makes the following recommendations:

 a bespoke licence will be used, in preference to Creative Commons licences, which are a) too broad, b) cannot be reversed and c) difficult to apply in a multi-IP situation;

• the existing licence in use for HASSET will continue to be used; this has the following advantages:

• it is a single licence which can be adapted for both HASSET and ELSST and any combinations of their shared terms;

• the licence does not currently prevent profit-making, non-UK or non-academic usage, but, through the offline completion of its custom fields, has the flexibility to allow all users to gain access to the product;

• it does not permit the thesaurus being passed on to third parties;

• it already includes the requirement that all translations must be made against the British English core terms from the Humanities And Social Science Electronic Thesaurus (HASSET), and plans are already in place, via additional resources, for the management of such translations;

• each concept, encoded in RDF, will be browseable via Pubby, which will also contain copyright information in its Dublin Core metadata and log accesses via Google Analytics (capturing domain names/IP addresses);

• each term and hierarchy will be browseable via the humanly-readable web pages which will also contain copyright information in its page metadata and displayed on screen, and log accesses via Google Analytics (capturing domain names/IP addresses);

• non-profit making users will be able to download the full thesaurus online by joining the HASSET/ELSST community (involving simple authentication and agreeing to our terms and conditions);

• profit-making users and those wishing to make new translations will be able to gain access to the product through the offline completion of the licence;

• the system to be developed for non-profit making users is envisaged as an amalgamation of the NAL and Agrovoc systems, via the collection of the potential user’s details, including:

• name;
• email address;
• position;
• organisation;
• purpose to be made of the thesaurus;
• date;

and the pre-population of the custom fields of the licence with the information supplied by the user; the user will also be required to click to show that they have agreed to the licence’s terms and conditions;

• the licence will run for twelve months or part thereof to the date of next release from date of acceptance; licence renewals will be generated via an alerting service (to be established post-project) and will be tied in with the date of annual release of new versions of the thesaurus;

• licence-holders will become members of the HASSET/ELSST community and will be invited to share their uses of the thesauri and to comment on their development in a variety of ways (for instance, via workshops or an annual conference if these are viable, and via the blog);

• simple and appropriate authentication should be employed to verify the user’s email address (e.g. via an automated message, containing a link which must be activated and/or using Shibboleth or similar); the precise form of authentication is yet to be decided.

These recommendations, once implemented, should allow for a simple, yet regulated system, which is a compromise between an entirely open system and one which only permits use of the thesaurus after a signature has been received. It employs a degree of trust in relation to users’ acceptance of the not-for-profit terms and conditions, but also allows the University to maintain the quality and integrity of the product and to track its thesaurus users and their usage of its product. Finally, and importantly, it provides the licence-holders both with the ability to browse the thesauri freely online and to become part of a wider community of thesaurus users.

References

Blumauer, A. (2010) ‘Why SKOS thesauri matter – the next generation of semantic technologies’. Semantic Web Company blog post, 31 August 2010 [http://blog.semantic-web.at/2010/08/31/why-skos-thesauri-matter-the-next-generation-of-semantic-technologies/]

Hargreaves, I. (2011) Digital opportunity: a review of intellectual property and growth. [Newport, South Wales, UK Intellectual Property Office]

Korn, N. (2011) Overview of the ‘Opennness’ of licences to provide access to materials, data, databases and media. JISC/Naomi Korn Copyright Consultancy, January 2011.

Méndez, E. and Greenberg, J. (2012) Linked Data for Open Vocabularies and Hive’s Global Framework, El Profesional de la Información; May/Jun2012, 21 (3), pp. 236-244.

Miller, I., Styles, R. and Heath, T. (2008) ‘Open data commons : a licence for open data’, LDOW2008, 22 April 2008, Beijing, China. [http://events.linkeddata.org/ldow2008/papers/08-miller-styles-open-data-commons.pdf]

Pastor-Sanchez, J., Mendez, F. J. M. and Rodríguez-Muñoz, J. V. (2009) ‘Advantages of thesaurus representation using the Simple Knowledge Organization System (SKOS) compared with proposed alternatives’, Information Research, 14 (4), paper 422.

Posted in Access, Licensing, Project Management | Leave a comment

Language and Computation Day 2012, University of Essex

Lucy Bell and Mahmoud El-Haj, members of the SKOS-HASSET project team, recently presented the work of the project at the University of Essex’s 2012 Language and Computation Day (held on 4 October 2012).  The University’s Language and Computation Group is an inter-disciplinary research group, containing members drawn from a number of departments, including Language and Linguistics, Computer Science and Electronic Engineering as well as the UK Data Archive.  It organises inter-disciplinary meetings, as well as inviting external speakers to the University.

Lucy presented an overview of the work of the project: SKOS-HASSET: a project at the UK Data Archive. The presentation described the headline objectives of applying SKOS to HASSET, testing this via automatic indexing, investigating licensing and improving the user interfaces.  Lucy also gave a summary of the progress that the project has made.  We are now halfway through the contract and have achieved the following so far:

  • SKOS has been applied to HASSET
  • A system for the re-application of SKOS to other hierarchies has been established internally
  • The texts have been prepared for the automated indexing case study and two corpora (catalogue records and SQB questionnaires) have already been automatically indexed
  • The gold standard of manual indexing of questions is taking place (almost 21,000 questions have been indexed so far)
  • An evaluation timetable, incorporating both automatic and manual evaluation of the automatic indexing results, has been drawn up
  • The research on licences is well under way and the licensing report is expected soon
  • Initial requirements for user and management interfaces have been drafted
  • The project has been promoted via this blog and at conferences (see earlier blogs!)

Mahmoud presented a thorough review of the data mining work undertaken in Work Package 2: Keyword Indexing with a SKOS Version of HASSET Thesaurus.  Mahmoud’s presentation described the work to apply the automatic indexing to the four corpora which we are targeting (catalogue records, SQB questionnaires, full text case studies and support guides and questions/variables taken from Nesstar).

Many interesting questions were fielded, and suggestions given about how to extend work in this area, post-project.  A short debate was held about whether it would be possible to apply automatic indexing to the data within the Archive collections; however, it was concluded that questions of copyright, disclosure and data protection would not permit this.  Good ideas were also received from colleagues in other faculties regarding ways of extending HASSET via suggestions of new terms provided via automatic indexing.  These will be further examined as part of the sustainability work of the project.

Posted in Communication, Project Management, Text mining | Leave a comment