SKOS-HASSET webinar, broadcast 28 March 2013

A webinar describing the work of the SKOS-HASSET project and showcasing its results was broadcast on Thursday 28 March 2013.  The webinar was recorded and this file, plus the static slides, have been published on the UK Data Archive’s SKOS-HASSET pages.  Please do take a look.

The SKOS-HASSET project has now finished.  Thank you to everyone who contributed to it.

Lucy Bell

SKOS-HASSET Project Manager

Posted in Uncategorized | Leave a comment

Technical objectives and deliverables

Lucy Bell

Introduction

The SKOS HASSET project had several technical objectives and deliverables:

  1. to create SKOS-HASSET by applying RDF to an existing, well-respected and well-used thesaurus (HASSET)
  2. to bring HASSET and ELSST into a single framework at database level
  3. to improve and update HASSET’s online user-facing webpages, hosting SKOS-HASSET and using open source technologies wherever possible
  4. to extend ELSST’s online management interface (http://elsst.esds.ac.uk/) to facilitate the release of new versions of the thesaurus products

Following agreement from the JISC in October 2012, the second and fourth of these original technical objectives were refined and amended.  Two wide-ranging objectives in fact became three more specific ones.  This was done in response to a changed requirement landscape and to pave the way for further, and more in-depth development work for which we’ve received additional funding.

Rather than bringing both HASSET and ELSST together on a single platform and tweaking the ELSST management interface, it was agreed to:

  1. test the alignment of the HASSET and ELSST hierarchies by injecting HASSET terms into ELSST and testing that the combined hierarchies work
  2. establish version control system for new releases
  3. release a new version of ELSST, testing the mechanism

These actions will provide us with a good, solid base on which we can entirely re-imagine the management interface and underlying data, rather than tweaking an existing system.

SKOS-HASSET

SKOS-HASSET was created and validated and released online on 26 February 2013.  As was documented in a previous blog, we used Pubby as the publication tool.  This previous blog from Darren Bell describes the work undertaken to achieve this objective.

The product is available as genericode, Turtle and RDF.

Web pages

The HASSET web pages have been extended and enhanced with new, SKOS-related information and a browseable version of the thesaurus.  This HASSET browser (in beta at present) obtains its data from WCF REST services, supplying Json objects obtained from the relevant database queries.  Select boxes have been used to generate the humanly-browseable structure for HASSET.  Initially, a proof of concept was set up using asp.NET web forms.  This was further developed to enhance the users’ experience, by allowing searches within the terms, while also protecting the Archive’s intellectual property.  These new and updated pages were released on 27 March 2013.  Feedback from users is welcomed.

An online licence form for requests to download and use the entire thesaurus is also being developed.  We expect to release this within 2013.

Alignment of hierarchies

Information development and technical development work combined to achieve this objective.

Our project officers, Lorna Balkan and Suzanne Barbalet, compared all the hierarchies within HASSET and ELSST.  Those which differ were thoroughly investigated, with all the history and log files consulted and the extent of the issues identified.  The following results were found:

  • terms that are in HASSET but not in ELSST:
    the majority of these will remain but will not be deemed to be ‘core’ terms;
    those considered to have international applicability have been added to the ELSST comments file for discussion with CESSDA colleagues
  • terms that are in ELSST but not in HASSET:
    these were more crucial as they could have skewed the ‘core’ hierarchies;
    the majority of these were methodological terms; however, a small number were concepts that had been deleted from HASSET (but not yet from ELSST) in order to maintain currency and relevance of the thesaurus.  After investigation and consultation with European colleagues, it was decided that these terms should in fact be proposed as deletions from both products.  This will require official international agreement; to expedite this these terms have been added to the ELSST comments file as suggestions for deletion.

Technical systems have been established to monitor any differences between the two products, using SQL Server Reporting Services.  Ten reports have been set up, with alerts, to check that the hierarchies remain in alignment from now until their inclusion in a single application.

Additionally, systems have been established at the database level to identify all terms shared between the two products (known as ‘core’ terms).

Version control

A version control system has been established for both HASSET and ELSST.  The following principles are being followed:

  1. All terms are date-stamped
  2. All changes to terms are recorded, no matter how small, and stored in the HASSET history file.  The details of the user who made the changes are also recorded
  3. All version information is available to the project team via a SQL Server Reporting Services dynamic interface
  4. Live versions of the thesaurus products are made available at regular, agreed intervals:
    1. ELSST is released annually, with major increments (1.00, 2.00 etc.); minor increments are not expected, but provision has been made for them in the first year
    2. SKOS-HASSET as an external product is released quarterly, with minor increments and annually as a major increment (1.00, 1.01, 1.02, 1.03, 2.00 etc.)
    3. HASSET is constantly updated and available for use for indexing internally
    4. SKOS-HASSET and ELSST annual version numbers will match

In order to test and implement this system, a previously-released version of HASSET was identified and version control applied.  This was version 1.00.  SKOS-HASSET was then released on 26 February 2013, version 1.00.  A second release (version 2.00) was then made on 25 March 2013.

From this point on, the pattern of quarterly releases began, with the next SKOS-HASSET version due in the second quarter of 2013.  This will be version 2.01.  A formal, internal procedure for managing these releases has been established.

Release of new version of ELSST

All existing ELSST translators and IP owners have been contacted and kept informed of all developments.

A new version of ELSST, including 136 ‘core’ terms agreed to have international applicability, was released on 25 March 2013.  This is version 2.00, bringing all the versions of the thesaurus products in line.  Version control will be applied at the table level.

Conclusion

All our technical objectives have now been completed and we are ready to move forward with our new and improved thesaurus products.  We are looking forward to taking this work further by entirely re-developing the management interfaces, which will give us, our international ELSST colleagues and the users of our thesauri improved and enhanced applications.

Posted in Project Management, Technical | Leave a comment

Automatic Evaluation Recommendations Report

Lorna Balkan

1 Introduction

This report describes how we applied the automatic indexing tool, KEA (see Witten et al. (2005)) to some of the UK Data Archive’s document collections, provided through the UK Data Service, and how we evaluated the results. Our aims were (1) to see whether Kea could potentially be used in the future at the Archive to aid metadata creation and (2) to develop recommendations for the future use of automatic indexing with an existing thesaurus.

Specifically, we sought to answer the following questions:

  • How well does KEA perform (compared to the gold standard) across a variety of corpora, where these corpora differ with respect to:
    1. genre
    2. topic
    3. total corpus size
    4. document length
    5. indexing type (in terms of number and level)
  • Why is Kea more successful on some document collections than others?

The experiment also revealed ways in which the thesaurus, HASSET, and Archive-internal metadata processes could be improved.

2 Evaluation tools and data

2.1 The document collection

Our initial intention was to use KEA to index the following document collections:

  1. The Nesstar bank of variables/questions
  2. Survey Question Bank (SQB) questionnaires
  3. ESDS data catalogue records
    1. abstracts (from all catalogue records)
    2. full catalogue records (from Study Number 5000 onwards: these are the most recent catalogue records, dating from 2005)
  4. Other full-text documents:
    1. case studies
    2. support guides

Corpus (3) was conflated into one corpus, consisting of all partial catalogue records. This happened because some studies did not have any abstracts, so basing the indexing on abstracts alone was not useful. The full catalogue record contained too many fields that were producing unhelpful index terms, so it was decided to use partial records only, consisting of the following fields:

  • Title
  • Alternative title
  • Series title
  • Abstract
  • Geographical units
  • Observation units

For the purposes of training KEA, all corpora needed to have manual indexes. Collections (2) and (3) had been previously indexed (but note that in Corpus (2) the number of manual index terms was restricted by the size of the PDF’s ‘document properties’ box, and in Corpus (3) manual indexing is based on full documentation, rather than the abstracts). Collection (4) had been indexed at subject category level. During the project, a mapping was defined between the subject categories and HASSET, and these HASSET terms then assigned to this collection (note that these terms are generally high level terms). Collection (1) was manually indexed especially for the project. See Barbalet (2013) for a fuller discussion of the manual indexing.

These four corpora differ in terms of:

  1. genre
  2. topic
  3. total corpus size
  4. document length
  5. indexing type (in terms of number and level)
  6. average number of manual keywords assigned per document

Note that, although corpus (4) consists of different types of document (user guides and case studies), because there were relatively few documents of each type, they were considered to be one corpus as far as training KEA was concerned.

Each document collection was divided into a training dataset (80% of the total number of documents in the collection), and a test dataset (the remaining 20%). KEA was trained on each training dataset separately, and evaluation results reported for that test dataset. Fifty documents from each test dataset were selected at random for manual evaluation (see Section 4 below).

2.2 HASSET

HASSET has been described in a previous blog (see Balkan and El Haj (2012)). For the purposes of the evaluation experiment, geographical terms were removed, since they produced too many incorrect index terms.

2.3 Description of KEA

El Haj (2012) has described in detail how KEA works. It is a term extraction system which includes a machine learning component. The algorithm works in two main steps:

(1) candidate term identification, which identifies candidate phrases (n-grams) from the text and maps these to thesaurus terms (in our case, HASSET). Candidate terms that are synonyms of preferred terms are mapped to their preferred terms.

(2) filtering, which uses a learned model (using training data labelled with thesaurus terms) to identify the most significant keywords based on certain properties or “features” which includes the tf.idf measure, the position of the first occurrence of a keyword in a document, the length of keywords and node degree (the number of keywords that are semantically related to a candidate keyword).

KEA was applied to three corpora (Nesstar questions/variables, SQB questionnaires and case studies/support guides) in stemming mode. The model was then rerun in non-stemming mode on the catalogue records.

For three corpora (SQB questionnaires, catalogue records and case studies/support guides) the system was set to produce up to 30 KEA keywords. For Nesstar, a maximum of 10 KEA keywords were generated, due to the small amount of text and the correspondingly few keywords that had been manually assigned.

3 Evaluation methodology

3.1 Overview

To judge the system’s performance, two types of evaluation were carried out:

  • automatic
  • manual

In the automatic evaluation, KEA-generated keywords were compared with the set of manually assigned keywords (the so-called ‘gold standard’). In addition, a manual evaluation was performed on a subset (50 documents) of the test set, which involved comparing the KEA keywords to the texts they have been used to index (see Section 4 below). The manual evaluation also sought to discover why KEA either failed to find concepts that had been assigned manually (so-called ‘silence’) or suggested incorrect terms (so-called ‘noise’) (see Section 7 below).

3.2 Evaluation metrics

The main evaluation metrics we used were precision, recall and F1-score, defined as follows:

· Precision =

· Recall =

· F1-score = 2 *

For the automatic evaluation, we define a KEA keyword to be ‘relevant’ if it is an exact match with a manual keyword. In manual evaluation, evaluators can judge a keyword to be relevant, even if it is not an exact match (see Section 4.3 below).

Precision, recall, and F1-scores were calculated on a document level, then aggregated over each document collection. We used an example-based, as opposed to a label-based approach for our aggregation scores (see Santos et al. (2011) and Madjarov et al. (2012)). The example-based approach sums the evaluation scores for each example (or document) and divides them by the total number of examples, while the label-based approach computes the evaluation score for each label (or keyword) separately and then averages the performance over all labels. Label-based evaluation includes micro-precision, macro-precision, etc. Given the large amount of keywords in our collection (potentially, the whole of HASSET), the example-based approach was preferred.

Formally, example-based precision, recall are calculated as follows (see Santos et al. (2011)):

Precision =

Recall =

where N is the number of examples in the evaluation set, Yiis the set of relevant keywords, and Zi is the set of machine generated keywords.

We calculated the Average F1 score by summing the F1 scores of all the documents and dividing by the number of documents.

4 Manual evaluation

4.1 Overview

In the automatic evaluation KEA-generated keywords were compared with the manually assigned keywords only. In the manual evaluation, evaluators were asked to judge how relevant the automatically-generated keywords were to the text to which they referred. Evaluators could also record, as secondary information, reasons for lack of relevance, and how close the KEA terms were to the set of manual keywords semantically – see Section 4.3 below.

4.2 Evaluators

Two evaluators were assigned to the evaluation task. Evaluator (1), who is an expert indexer, evaluated each corpus (except the user guide/case study corpus) for relevance; evaluator (2) evaluated the user guide/case study corpus for relevance and assigned relatedness scores to the corpora (see 4.3 below). Both evaluators have many years’ experience of working at the UK Data Archive, and have a thorough knowledge of HASSET.

4.3 Stages and protocol

Manual evaluation was performed in two separate stages. First, the evaluator was presented with an evaluation form (see Appendix 5) and asked to read the original text to judge the relevance (or suitability) of each KEA keyword, on a 3-point scale:

How suitable is the KEA term for Information Retrieval?

5. Extremely suitable = should definitely be keyword

2. Partially suitable = redundant[1], or somewhat too narrow or too broad

0. Unsuitable = far too broad, or completely wrong

These scores were used to derive Precision, Recall and F1 scores for the corpora (see Section 3.2 above). The definition of relevance/suitability in our experiment depends on the type of evaluation, as follows:

  • Automatic evaluation: the KEA keyword is considered relevant only if it is an exact match of a manual keyword
  • Manual evaluation:
    • ‘strictly relevant’: the KEA keyword is considered relevant if it is either an exact match of a manual keyword, or rated ’extremely suitable’ by the evaluator
    • ‘broadly relevant’: the KEA keyword is considered relevant if it is either an exact match of a manual keyword, or rated either ‘extremely suitable’ or ‘partially suitable’ by the evaluator.

In the evaluation form presented to the evaluator, the manual keywords are shown in alphabetical order, next to the KEA keywords that appear in the order in which they are ranked by KEA. It is assumed that all exact matches are ‘extremely suitable’, since they have been assigned by professional indexers, so the form is pre-filled to reflect this.

Evaluators were also asked, for keywords that are either ‘partially suitable’ or ‘unsuitable’, to provide the following information:

Reason for lack of suitability of Kea term:

  • Too broad
  • Too narrow
  • Redundant = concept already covered by other terms (that form an associative relationship) in the KEA set
  • Completely wrong

This information is used for the informal error analysis we performed (see Section 7). (Note that it was initially assumed that all redundant terms would be partially suitable, but in the event some were judged to be unsuitable.)

A second stage of evaluation, which was carried out independently of the first stage, sought to establish how closely related the KEA keywords are to the manual keywords, according to the following scale and criteria:

To what extent is the KEA term semantically related to the Gold standard?

5. Totally related (exact match)

4. Closely related: NT, BT or RT to manual keyword

3. Somewhat related: in the same hierarchy as manual keyword

2. Remotely related: related, but not in the same hierarchy as manual keyword

1. Unrelated

The first category (‘exact match’) is computed automatically and pre-filled in the evaluation form[2]. Note that relatedness scores were not calculated for the catalogue record corpus, due to time constraints.

5 Comparison with other approaches

The standard evaluation paradigm for automatic indexing is to automatically compare machine-generated indexes with a gold standard. The problems with this approach are well documented, since the choice of index terms is often very subjective[3]. Approaches to overcoming this problem include pooling the set of index terms suggested by a number of different indexers, and taking either that or their intersection as the gold standard (see for example Pouliquen et al. (2003)). Another common approach is to accept not just exact matches as relevant terms, but those terms that are semantically related to the manual keywords. Semantic relatedness to the manual keywords can be computed automatically based on various criteria, for example closeness of the terms in the thesaurus hierarchy (see for example Medelyan and Witten (2005) and/or morphological similarity (for example Zesch, T. and Gurevych, I. (2009).

A problem with these approaches is that they assume that the closer the match to the gold standard semantically, the more relevant a term will be as a keyword. However, we show (in Section 8.4 below) that this assumption is not always true, as in some of our corpora, some relevant keywords were totally unrelated to those in the gold standard for reasons discussed in Section 2.1 above.

The alternative to automatic evaluation is to perform a manual evaluation. However, this is both costly and time-consuming. In a manual evaluation, evaluators are most often asked to compare the set of machine-generated keywords directly with the source text (see for example Jacquemin et al. (2002), Névéol et al. (2004) and Medline (2002)). While this is a useful approach, it does not capture any relationship between what the machine assigns and what the human assigns as keywords, which may be useful to know.

Our approach aims to capture both relatedness to the gold standard and relevance to the source text.

Many manual evaluations also involve a qualitative analysis of the automatic indexing terms (see for example Abdul and Khoo (1989), Eckert et al. (2008), Clements, and Medline (2002). We also undertook an error analysis of our results (see Section 7).

6 Limitations of the approach

Our experiment suffered from a number of limitations, due mainly to time constraints:

  • Gold standard:
    • Taking the set of manually assigned keywords as the gold standard is particularly problematic for the following datasets:
      • Catalogue records: manual keywords and KEA keywords have not been used to index the same thing – the manual indexers have indexed the whole documentation, while KEA has been used to index partial catalogue records only, including the abstract. The abstract is, however, often taken from the documentation and is a summary of it.
      • SQB questionnaires: due to space restrictions (see Section 2.1 above) existing manual keywords have been limited in number.
      • Case studies/user guides: a small subset of the thesaurus (mapped from subject categories) has been used for manual indexing, while KEA used the whole thesaurus.
  • The number of keywords in the gold standard does not always match the number of keywords generated by KEA. Where there are more manually-generated keywords than KEA keywords, 100% recall will not be possible. Several researchers have suggested recalculating evaluation scores for different numbers of automatic keywords (e.g. 5, 10, 20) or using a ‘dynamic rank’, i.e. where the number of manually assigned keywords is the same as the number of automatically assigned keywords (see for example Steinberger et al. (2012)). As Steinberger et al. point out, however, this is not helpful for new texts which have no manual terms as a reference.
  • Number of evaluators:Due to time and resource restrictions, only one evaluator was used to evaluate each document – ideally, a number of evaluators would evaluate each document and results averaged across them.
  • Indexer-centred evaluation:This evaluation is very much indexer-centred, since it is designed to investigate whether or not KEA could be a useful tool for indexers. To get a proper estimate of its value to users, a user-centred evaluation would need to be conducted.
  • Evaluation form:A blind evaluation, where the evaluator is unaware whether the keyword has been generated manually or automatically, would make the evaluation less subjective.

7 Error analysis

We distinguish between precision errors, where KEA returns incorrect or partially relevant terms, and recall errors, where KEA fails to find relevant terms.

7.1 Precision errors

Reasons for poor precision include cases where the KEA keyword was:

  1. too broad
    E.g. EMPLOYMENT for when the text is about “STUDENT EMPLOYMENT”
    This includes cases where the keyword is chiefly used as a placeholder in HASSET, with a scope note advising the use of a more specific term instead:
    E.g. RESOURCES has the following scope note: “AVAILABLE MEANS OR ASSETS, INCLUDING SOURCES OF ASSISTANCE, SUPPLY, OR SUPPORT (NOTE: USE A MORE SPECIFIC TERM IF POSSIBLE). (ERIC)”)
  2. too narrow
    This generally only occurred in the case study/support guide corpus.
  3. redundant
    E.g. CHILDBIRTH when PREMATURE BIRTHS is also found
  4. somewhat or completely wrong

Category (4) can be for a variety of reasons, including:

  1. KEA identifies the correct term, but it has a different meaning in HASSET to that in the text because:
    • they are homonyms:
      E.g. KEA retrieves the keyword WINDOWS (meaning features of a house) because it matches “Windows” (Computer software) in the text
    • the term is used idiomatically in the text, but has a literal meaning in HASSET[4]
      E.g. PRICES is used in its literal sense in HASSET, but idiomatically in the text – “Do higher wages come at a price?”
    • the HASSET term is used in a restricted sense (often indicated in a scope note) which is different to the general language usage found in the text
      E.g. the HASSET term WORKPLACE is used to refer to the location of work only
  2. KEA identifies the wrong term because:
    • it fails to distinguish between terms containing qualifiers
      E.g. KEA retrieves CHILDBIRTH (UF: “LABOUR (PREGNANCY)”) instead of ” LABOUR (WORK)” because it matches “labour” (meaning ‘work’) in the text
    • it is unable to parse a compound term correctly:
      E.g. KEA retrieves DEVELOPMENT POLICY because “collections development policy” was found in the text (i.e. it matches an incorrect sub-part of the compound term)
    • the errors are due to stemming:
      E.g. NATIONALISM matches “nation” in the text
      E.g. TRUST matches “trusts” in the text
      E.g. ORDINATION matches “co-ordination” in the text
  3. Too many closely related terms in HASSET make it difficult for KEA to discriminate between them. Examples include:
    • the many variants of TRAINING in the thesaurus – FURTHER TRAINING, OCCUPATIONAL TRAINING, EMPLOYER-SPONSORED TRAINING
    • OFFSPRING and CHILDREN: OFFSPRING has the scope note: USE SPECIFICALLY FOR CHILDREN, REGARDLESS OF AGE. NOT TO BE USED AS AN AGE IDENTIFIER. MAY BE USED FOR ADULT CHILDREN, OR FOR QUESTIONS WHERE AGE OF CHILD IS NOT SPECIFIED. TERM CREATED JUNE 2005. PREVIOUSLY THE TERM “CHILDREN” MAY HAVE BEEN USED.
    • CRIMES is a UF of OFFENCES, not CRIME, which is a separate term
  4. The term belongs to a part of the document that should be ignored for indexing purposes (e.g. author names in case studies). Note that KEA gives greater weight to terms the closer they occur to the beginning of the document. This causes a problem in some corpora, e.g. user guides and case studies, which often begin with background text to set the topic in context.

Possible solutions for precision errors:

  • Add terms to stopwords:
    E.g. INFORMATION, DATA, RESEARCH, ANALYSIS, EVALUATION, TESTS
  • Add new UFs to preferred terms
  • Reduce stemming[5]
  • Remove irrelevant parts of the text:
    This would be relatively easy to do for some of our corpora, e.g. case studies
  • Apply word sense disambiguation (WSD) techniques to help identify the correct use of a homonym in HASSET:
    There is some form of context sensitivity in KEA, since the filtering stage (see Section 2.3 above) is based partly on the node degree (the number of keywords that are semantically related to a candidate keyword). Other ways of introducing context sensitivity have been discussed in the literature – see for example Pouliquen et al. (2003) who use the notion of associate terms to help select keywords.

7.2 Recall errors

Reasons for poor recall include:

  1. The concept is in HASSET, but is not recognised by KEA
  2. The concept is not in HASSET so has to be represented by a combination of other terms. Examples include:
    • CHILDHOOD POVERTY is represented by SOCIALLY DISADVANTAGED CHILDREN
    • some methodological terms

Sources of recall errors include:

  1. The HASSET term has a slightly different form from that found in the text:E.g. PRICE POLICY is not found although “pricing policy” is in text
  2. The HASSET term is hyphenated, while the term in the text is not:E.g. BREAST-FEEDING not found although “breastfeeding” is in text
  3. the HASSET term is too abstract to be found verbatim in the textE.g. LIFE SATISFACTION is not found, although “enjoying life” is in text

In many cases, the source of recall errors is not obvious, and needs further investigation.

Possible solutions for recall errors include:

  • Add new UFs to current HASSET terms
  • Add new preferred terms
  • Add stemming

8 Discussion of the results

8.1 General remarks

The following subsections summarise the results of the evaluation as shown in the appendices below. The results are all based on 50 samples of each test set. It should be borne in mind that the catalogue record results were produced using a different model of KEA, and unlike the other corpora, KEA indexing was based on a subset of the text that was used for the manual indexing, as explained in Section 2.1 above. For this reason, the results are not directly comparable with those of the other corpora. 8.2 Precision, recall and F1

Performance was measured in terms of Precision, Recall and F1. Three different degrees of each were recorded – Auto, Strict and Broad, as shown in Appendix 1:

  • Auto: auto scores are based on exact matches between KEA and manual keywords. They can be computed automatically
  • ‘Strict': include exact matches and additional KEA keywords rated ’extremely suitable’ by the evaluator
  • ‘Broad': include exact matches, and additional KEA keywords rated either ‘extremely suitable‘ or ‘partially suitable’ by the evaluator.

Individually, the best performance overall was seen in the SQB corpus, with a broad F1 score of 0.43. Close behind are the Nesstar and case studies/support guide corpora, with c. 0.35 each. Catalogue records had a low F1 score of 0.21. This was to be expected, given that KEA had relatively little text to index from, compared to the manual indexers. This, together with the fact that KEA was applied in non-stemming mode, led to a poor recall score. However, the precision rate for catalogue records was 0.42, which means that the keywords KEA found are very often relevant.

The highest recall score was found in the case study/support guide corpus (0.73). This suggests that KEA could be usefully employed to suggest new relevant terms for this type of corpus.

As expected there was relatively little overlap between KEA keywords and manual keywords (on average KEA found 18.60 keywords per document across the four corpora, of which only 2.33 were exact matches with the manual keywords) – see Appendix 4.2. However, a high percentage of KEA keywords were considered relevant/suitable even if they were not exact matches – 33% for the SQB corpus, with an average of 25% across all four corpora. This suggests that KEA could be a very useful tool for indexers.

It is not clear, from our initial experiments, to what extent the system’s performance is dependent on the number and size of training documents. Our four corpora differ considerably in this respect – the Nesstar collection contains the largest number of documents, but these are very small in size, containing a single question, while the SQB corpus contains the largest documents. The training datasets are 80% of the total number of documents in each corpus. If we assume that the training datasets are also 80% of the size of the entire corpora (which may not hold, since documents are not of a uniform length) then we can conclude that the highest F1 score is reported for the the corpus with the largest amount of training data (i.e. the SQB corpus as shown in Appendix 1) rather than that with the largest number of training documents (i.e. the Nesstar corpus). Performance is clearly related not just to the number and size of the training documents but to their associated keywords. The average number of manual keywords assigned per document varies considerably – in the sample shown in Appendix 4.2, the average number of manual keywords varies from 1.63 for Nesstar questions/variables to 62.86 for catalogue records – but it is important to bear in mind differences in the completeness and level of the keywords, as well as their number.

Further investigation is also required to establish the influence of genre on performance. For example, the case studies and support guide collections were considered to be a single corpus for the sake of the experiment, but may exhibit different behaviour if processed and evaluated separately. The support guides, for instance, use more formal language than the case studies, which are aimed at a popular audience, so are unlikely to cause precision errors due to the idiomatic use of language.

Within text types, topic clearly plays a role. For example, support guides on methodology and cataloguing procedures fared less well than the guides on substantive topics like health and employment, since HASSET has few terms to cover the first topics.[6]

8.3 Reasons for partial or lack of suitability of keywords

For three out of the four corpora (Nesstar, SQB and catalogue records), those terms which are deemed to be partially suitable, instead of extremely suitable were too broad (see Appendix 2). Only in the case of case studies/support guides, were a sizeable proportion of partially suitable keywords (c.50%) deemed to be too narrow.

Across all four corpora unsuitable terms were usually judged to be completely wrong, rather than too broad.[7]

8.4 Relatedness scores

Relatedness measures how close the KEA keywords are to the manual keywords.

Across the three corpora that we rated for relatedness (Nesstar, SQB and case studies/support guides), there is no consistent relationship between the suitability of KEA keywords and their relatedness to the manual keywords. In the case of Nesstar, 100% of keywords that were not exact matches of the manual keywords but deemed ‘extremely suitable’ were either closely or somewhat related to the manual keywords, while with the SQB and case studies/support guides over 50% were either remotely or unrelated to the manual keywords (see Appendix 3).

This suggests that, in the absence of manual evaluation, relatedness of KEA keywords to manual keywords based on their position in the thesaurus could not be used as a good indicator of whether or not they are extremely relevant. A similar situation is true for partially suitable keywords and their relationship with manual keywords.

There could be several reasons for this. First, the manual keywords may exclude important topics, due to time or space limitations (this is particularly true for the SQB and the case study/support guide corpora). Alternatively, or additionally, the KEA keywords may well be closely related to the manual keywords, but this is not reflected in the thesaurus structure, upon which the relatedness definitions are based. An example is VOTING, which was is not related to VOTING BEHAVIOUR or VOTING INTENTION in the thesaurus. Examination of cases such as these will be useful when we come to revise the thesaurus.

Conversely, some KEA keywords rated as ‘somewhat related’ because they share the same hierarchy as a manual keyword, are in fact semantically far distant. For example, SUGAR is in the same hierarchy as MEDICAL DRUGS and is thus judged to be somewhat related. This is because their shared hierarchy PRODUCTS is very large.

9 Conclusions and recommendations

Our experiments with KEA proved a useful introduction to automatic indexing at the UK Data Archive. The results of our initial investigations are encouraging, and lead us to believe that KEA could provide a useful tool to our indexers. We would however have to conduct user-oriented evaluation to see how the system could be incorporated into our work-flows.

It would also be useful to conduct further experiments with the system to see how we could improve the model and run the system more efficiently (see some of the suggestions we make in Section 7).

Work on KEA has also provided us with useful insights into how we could improve our processing procedures (which we reviewed when we prepared the texts and metadata prior to running KEA) and the thesaurus – for example, our preliminary error analysis highlighted the need for more synonyms and revealed cases where there are too many similar terms.

We make the following recommendations:

  1. KEA is a useful tool for indexers of full text social science materials;
  2. however, KEA would work best as a suggester of new terms, with moderation from a human indexer;
  3. KEA could also be used as a quality assurance tool, to ensure that terms are not overlooked – some terms it suggested that were highly relevant had not been included in the gold standard, manual indexing;
  4. more work is needed to investigate KEA further and to see how it could be incorporated technically, and in terms of process, into ingest systems.

Notes

1. See Barbalet (2013) for a more detailed discussion of redundancy in indexing.

2. Note that the calculation of categories ‘closely related’ and ‘somewhat related’ could also be automated, and this may be implemented at a future date.

3. Indexing of our corpora follows quality control procedures which helps address the problem of subjectivity – see Barbalet (2013).

4. Cases like these were only found in the case studies/user guides corpus.

5. Note, however, that while reducing stemming will improve precision, it will have a negative impact on recall, and there is always a trade-off between precision and recall.

6. It turned out also that the Nesstar training dataset contained many duplicated questions, unlike the test sample, which had very few, and this may have had an effect on the results.

7. Note: there were also some cases of terms being unsuitable because they were too narrow, or redundant – neither of these possibilities were envisaged when the experiment was set up, so statistics for these are not reported separately.  See Barbalet (2013) for possible reasons for these error types.

References

Abdul, H. and Khoo, C. (1989) ‘Automatic indexing for medical literature using phrase matching – an exploratory study’, In Health Information: New Directions: Proceedings of the Joint Conference of the Health Libraries Sections of the Australian Library and Information Association and New Zealand Library Association, Auckland, New Zealand. 12-16 November 1989, pp. 164-172.

Balkan, L. and El Haj (2012): SKOS-HASSET evaluation plan, blog.
http://hassetukda.wordpress.com/2012/08/16/skos-hasset-evaluation-plan/

Barbalet, S. (2013) Gold standard indexes for SKOS-HASSET evaluation: a review, blog. http://hassetukda.wordpress.com/

Clements, J. ‘An Evaluation of Automatically Assigned Subject Metadata using AgroTagger and HIVE’ http://aims.fao.org/sites/default/files/files/Clements_FAO_Metadata_Assignment.pdf

Eckert, K., Stucken-Schmidt, H. and Pfeffer, M. (2008): Interactive thesaurus assessment for automatic document annotation, in Proceedings of the Fourth International Conference on knowledge capture (k-cap 2007), Whistler, Canada. http://publications.wim.uni-mannheim.de/informatik/lski/Eckert07Thesaurus.pdf

El Haj, M. (2012) UKDA Keyword Indexing with a SKOS Version of HASSET Thesaurus, blog. http://hassetukda.wordpress.com/2012/09/24/ukda-keyword-indexing-with-a-skos-version-of-hasset-thesaurus/

Jacquemin, C., Daille, B., Royaute, J. and Polanco, X (2002): ‘In vitro evaluation of a program for machine-aided indexing’, Information Processing and Management, 38(6): pp. 765-792, http://perso.limsi.fr/jacquemi/FTP/IPM-1354-jacquemin-et-al.pdf

Madjarov, G., Kocev, D., Gjorgjevikj, D. and Džeroski, S. (2012) ‘An extensive experimental comparison of methods for multi-label learning’, Pattern Recognition, doi:10.1016/j.patcog.2012.03.004, http://kt.ijs.si/DragiKocev/wikipage/lib/exe/fetch.php?media=2012pr_ml_comparison.pdf

Manning, C., Raghavan, P., and Schutze, H. Introduction to Information Retrieval (2008), Cambridge University Press, http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-text-classification-1.html

Medleyan, O. and Witten, I. (2005) ‘Thesaurus-based index term extraction for agricultural documents, in: Proceedings of the 6th Agricultural Ontology Service (AOS) workshop at EFITA/WCCA 2005, Vila Real, Portugal. http://www.medelyan.com/files/efita05_index_term_extraction_agriculture.pdf

Medline (2002): A Medline indexing experiment using terms suggested by MTI: A report. http://ii.nlm.nih.gov/resources/ResultsEvaluationReport.pdf

Névéol, A., Soualmia, L.F., Douyère, M., Rogozan, A., Thirion, B. and Darmoni, S.J. (2004), ‘Using CISMeF MeSH ‘‘Encapsulated’’ terminology and a categorization algorithm for health resources, in International Journal of Medical Informatics, 73, pp. 57-54, Elsevier. http://mini.ncbi.nih.gov/CBBresearch/Fellows/Neveol/NeveolIJMI04.pdf

Pouliquen, B., Steinberger, R. and Ignat, C.(2003)Automatic annotation of multilingual text collections with a conceptual thesaurus, Proceedings of the Workshop Ontologies and Information Extraction at EUROLAN 2003, http://arxiv.org/ftp/cs/papers/0609/0609059.pdf

Santos, A., Canuto, A. and Feitosa Neto, A. (2011) ‘A Comparative Analysis of Classification Methods to Multi-label Tasks in Different Application Domains’, Computer Information Systems and Industrial Management Applications, 3, pp.218-227, http://www.mirlabs.org/ijcisim/regular_papers_2011/Paper26.pdf

Spasic, I., Schober, D., Sansone, S-A., Rebholz-Schumann, D., Kell, D. and Paton, N. (2008) ‘Facilitating the development of controlled vocabularies for metabolomics technologies with text mining’, BMC Bioinformatics 2008, 9(Suppl 5):S5. http://www.biomedcentral.com/1471-2105/9/S5/S5

Steinberger, R., Ebrahim M., and Turchi, M. (2012) ‘JRC EuroVoc Indexer JEX – A freely available multi-label categorisation tool’, LREC conference proceedings 2012. http://www.lrec-conf.org/proceedings/lrec2012/pdf/875_Paper.pdf

Witten, I.H., Paynter, G.W., Frank, E, Gutwin, C. and Nevill-Manning, C.G. (2005) ‘Kea: Practical automatic keyphrase extraction’, in Y.-L. Theng and S. Foo (eds.) Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, Information Science Publishing, London, pp. 129–152.

Zesch, T. and Gurevych, I. (2009) ‘Approximate Matching for Evaluating Keyphrase Extraction’, In Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing September 2009, pp. 484-489, http://aclweb.org/anthology-new/R/R09/R09-1086.pdf

Appendix 1 Average Precision, Recall and F1 scores

Appendix 2 Reasons for partial or lack of suitability of keywords

2.1 Reasons for partial suitability of keywords

Average percentage per document of partially suitable keywords that are:

Corpus name Too broad Too narrow Redundant
Nesstar 85.00% 5.00% 10.00%
SQB 64.85% 19.61% 12.35%
Cat. records 84.38% 0.00% 12.50%
Case studies/ support guides 50.69% 49.31% 0.00%

2.2 Reasons for unsuitability of keywords

Average percentage per document of partially suitable keywords that are:

Corpus name Too broad Completely wrong
Nesstar 23.66% 74.19%
SQB 1.35% 96.38%
Cat. records 1.27% 96.23%
Case studies/ support guides 16.22% 73.71%

Appendix 3 Average relatedness scores

3.1 Average relatedness to manual keywords of extremely suitable keywords that are not exact matches

Corpus name Closely related somewhat related remotely related unrelated
Nesstar 50.00% 50.00% 0.00% 0.00%
SQB 31.33% 17.71% 11.35% 41.63%
Cat. records
Case studies/ support guides 25.19% 14.29% 15.85% 44.63%

3.2 Average relatedness to manual keywords of partially suitable keywords

Corpus name Closely related somewhat related remotely related unrelated
Nesstar 27.50% 42.50% 12.50% 17.50%
SQB 29.79% 21.19% 9.42% 39.60%
Cat. records
Case studies/ support guides 32.37% 7.42% 16.01% 44.20%

Appendix 4 Other statistics

4.1 Average relatedness and suitability

Corpus name Average relatedness Average suitability
Nesstar 2.12 1.26
SQB 2.29 2.01
Cat. Records 1.97
Case studies/ support guides 1.67 1.04

4.2 Average number of keywords (manual and Kea) per document

Appendix 5 Example of evaluation form

Posted in Evaluation, Text mining | Leave a comment

SKOS HASSET Development Process

John Payne

One of the deliverables of the SKOS HASSET project was to provide the ability to present the HASSET thesaurus as SKOS linked data.  This required elements of work including database cleansing, application creation and software configuration.  As described in an earlier blog entry, we chose to deliver SKOS by utilising the open source project PUBBY and by converting our SQL Server based HASSET thesaurus into RDF and storing this in a BrightstarDB triple store.  For more information on SKOS, Pubby and Triple stores, refer to this post.

At the Data Archive, the Application Development and Maintenance team adopt Agile methodologies wherever possible to see projects through to deployment.  We employed these Agile techniques in the SKOS HASSET project in order to focus and drive development through to delivery of the completed product.

At the Archive, we have used JIRA for three years to manage both issue tracking and development tasks across our complete range of projects – both development and maintenance.  We have a plugin called GreenHopper installed within JIRA, that delivers an Agile presentation layer on top of the issues/tasks and provides configurable Scrum and Kanban views and functionality on top of any project or combination of projects you choose.  The documentation even suggests you combine the two and try Scrumban!

The combination of JIRA and Kanban where used for this sprint to track issues and progress.  JIRA issues contained the tasks, including current status, comments, time logging etc. and we used the Kanban JIRA plugin to give a visual representation of the current state of play and progression of tasks from ‘not started’ through ‘in progress’ to ‘complete’ every morning, rather than using physical post-it notes and a whiteboard.

We decided to use two sprints during the development of SKOS-HASSET with the second following several weeks after the first.  This is our preferred strategy.  Sprint one had specific goals in terms of laying the groundwork in terms of data quality and the production internally of valid a valid SKOS file.  Sprint two tied everything together and addressed any issues that came to light during sprint one.

The process we adopted was:

Sprint Preparation

During sprint preparation, the three developers involved met and picked through the complete list of issues/requirements to familiarise everyone with the task at hand.  These tasks where then created within JIRA and each was assigned to an individual and prioritised.

Sprint

The initial sprint lasted for five days and primarily involved validating and cleaning the data and creating an application to create valid triples from our relational database version of the thesaurus. Every morning of the sprint would involve a short ‘stand-up’ meeting where progress, problems and proposed work for the current day would be briefly described by each developer.  This was backup up visually by using the Kanban view provided by Greenhopper.  All application code created was stored in SVN source control and built from within Jenkins, our continuous integration server in order to satisfy our coding quality standards.

Post sprint review

In the week following the sprint, the developers met to reflect on what we had achieved and what issues we encountered.  This was also a good opportunity to make sure that both the addressed and remaining issues had been documented and commented upon in readiness for the second sprint.

Sprint 2

Sprint 2 was a smaller, two-day sprint and was the final push to actually move from our development environment to that of a production environment ready for external consumption.  The requirements for sprint 2 were not data-related but focused instead on implementing Pubby on a newly set up production server and ensuring that all underlying data creation and was now being supplied from the production environment.

Sprint preparation

The developers once again met to discuss the remaining list of issues/requirements.  These were then reassigned in certain instances and reprioritised.  During this second preparation phase, we also tried to resolve any external dependencies that would otherwise hamper the forthcoming sprint such as setting up of domain names and preparing for firewall changes etc.

Sprint

The second sprint was better described as a dash with it being so short!  Most of this sprint involved configuring a new production web server to host Pubby, correctly installing our Triple Store onto its live server and deploying application code and tables from development to production.

Post sprint

It would be lovely to say that after sprint 2, all our issues were closed but this is not quite true.  We still have a couple of small internal loose ends but these either do not directly affect the SKOS HASSET product or they were moved out of the scope of this development cycle. One advantage of JIRA to manage tasks is that these remaining issues are formally documented and must be commented on, resolved and closed by the project managerbefore the project is completed.

As I started out by saying, in terms of scale, the SKOS-HASSET development was only small but our decision to adopt the ‘sprinting mind-set’ was a sensible choice.  The Agile techniques of sprinting and having short, stand up morning meetings are insightful and not only deliver information, they act as glue between the team members and provide the focus and impetus to keep momentum going and deliver results in a short timeframe.

Posted in Uncategorized | Leave a comment

SKOS-HASSET Webinar: 28 March 2013, 10:00 – 11:00

The SKOS-HASSET team will be presenting the results of its work on 28 March 2013, at 10:00 GMT via a Webinar.  Please join us to hear more about the project.  Space is limited, so do sign up with GotoWebinar to reserve your place.  After registering you will receive a confirmation email containing information about joining the Webinar.

The webinar will describe the work undertaken in the SKOS-HASSET project, a 10-month, JISC-funded project in the UK Data Archive, University of Essex, which ran from June 2012 until March 2013.  Its aims were to:

1. Apply SKOS to HASSET
2. Improve its online presence
3. Test its automated indexing capacity

Simple Knowledge Organization System (SKOS) is a language designed to represent thesauri and other classification resources.  It encodes these products in a standardised way, using RDF, to make their structures comparable and to facilitate interaction.

The webinar will give an overview of the aims and objectives of the project, the technologies used and the results of the automated indexing work.  For this last piece of work, SKOS-HASSET was taken as the terminology source for an automatic indexing tool (KEA) and applied to question text, variables, abstracts and publications from the Archive’s collection.  The results were compared to the gold standard of humanly-undertaken indexing. These will be presented at the webinar.  SKOS-HASSET itself will also be demonstrated.

Webinar invitation details:

Title: SKOS-HASSET JISC-funded project: results and discussion
Date:    Thursday, March 28, 2013
Time:    10:00 AM – 11:00 AM GMT
Sign up URL:    https://www3.gotomeeting.com/register/897319998

System Requirements
PC-based attendees
Required: Windows® 7, Vista, XP or 2003 Server

Mac®-based attendees
Required: Mac OS® X 10.5 or newer

Mobile attendees
Required: iPhone®, iPad®, Android™ phone or Android tablet

We hope to see you there!

The SKOS-HASSET Team

Posted in Uncategorized | Leave a comment

Gold Standard Indexes for SKOS-HASSET Evaluation: A Review

Suzanne Barbalet

1. Introduction

The gold standard indexes we used for SKOS-HASSET training and evaluation were a combination of our in-house quality controlled indexes and a specially prepared index used for training Kea to index variables. The latter task was performed as exhaustively as possible to enhance training but could not ignore efficiency restraints. All indexes conform to ISO standards.

In–house indexing takes account of:

  • the perceived needs of the users
  • policy issues incorporating plans for the future development of the collection

Indexing data and indexing documents are slightly different processes:

  • Concepts within data are measurements and their definitions may vary
  • Concepts within data-related documentation have a general language definition

Indexing the former type of concept is a two stage process of translating its operationalized definition into a general language definition and from that to a thesaurus definition. The latter is a simpler process requiring some form of mapping between a general language definition and the thesaurus definition. This blog explains these two processes in greater detail and outlines how the indexes were prepared with reference to the functions each corpus was designed to perform.

2. Information retrieval requirements

Funded by ESRC, the JISC and the University of Essex, the UK Data Archive is committed to supporting secondary analysis; that is the re-use of quantitative, qualitative or historical data. Data analysis requires knowledge of and access to specialist statistical software such as SPSS or Stata; thus, the Archive’s users are a special clientele who rely on the Archive both to supply data in an appropriate form and to support their use.

Since Archive users are primarily data analysts, an important access requirement is variable information. Some users will be specialists in a particular data collection, others will have broader interests and require subject rather than variable access. Data analysts will wish to retrieve particular variables of interest or to locate studies of interest; survey managers, teachers of survey methods, social and economic researchers and post graduate students wishing to conduct pre-analysis will search for relevant survey questions and other information in the documentation.

To meet these needs the Archive has enriched its resources through the provision of the Nesstar service, the Survey Question Bank project and by producing Case Studies and Support Guides. These resources can be represented thus:

Fig 1. SKOS-HASSET Test Corpora

3. The Corpora

We used four corpora for testing Kea for the SKOS-HASSET project. They comprised:
ESDS Data Catalogue Records with existing study level keywords plus three supplementary corpora

§ Survey Question Bank PDFs with existing keywords providing access to documentation the depositor supplies with the dataset

§ Case Studies with existing subject indexes providing examples of how Archive studies have been used in teaching and research and Support Guides providing an overview of the data collections and internal procedures.

§ Nesstar Bank of Variables/Questions including only research datasets with a tailor-made index

These supplementary resources have evolved to provide public access to ‘data-related’ documentation and to view frequencies and make cross tabulations using the Nesstar service.

Procedures for cataloguing and indexing have evolved alongside the development of the Archive’s thesaurus HASSET. Initially HASSET was based on the UNESCO Thesaurus (1977). Now established as a general social science thesaurus, indexing is directed towards providing the best access to variables within the data. Limits are not imposed on the number of index terms permitted per data collection in order to maintain a high standard of information retrieval. Cataloguers select as many new terms as they find necessary which may vary between 10 and 450 terms. Data indexing tends to incorporate both broad and narrow terms to cover the variation of concept levels that variables may include.

Data are a complex object in information retrieval terms. When concepts are operationalized the indicators specified will rarely be found in the same thesaurus hierarchy nor necessarily have a lexical relationship. For example the Office for National Statistics webpage Different Aspects of Ethnicity recommends that the variables ‘Country of birth’, ‘Nationality’, ‘Language spoken at home’, ‘Skin colour’, ‘National/geographical origin’, and ‘Religion’ should together indicate the concept of ‘Ethnicity’. At the same time indicators such as ‘Religion’ may also contribute to the operationalization of other concepts. Providing this information, together with information on data collection methods, data preparation and results or findings is best practice for data depositors and is key to enabling the secondary user to make informed use of the data. It is information that is incorporated in the catalogue record abstract and subject categories (from DDI element of <topCclas>) which were used for Kea.

All four corpora then reference data collections in the data catalogue, indexed with HASSET terms. Since each corpus was arranged to meet specific user needs, different strategies of indexing were necessary for each corpus and thus training for Kea was done separately.

4. Manual Indexing Practices

Two principles of indexing are worth discussing in the context of providing a gold standard for Kea. They are:

  • Specificity
  • Associative terms

4.1 The Principle of Specificity

The central principle of indexing is to use the most specific term that entirely covers the topic (Lancaster, 1998: 28; de Keyser, 2012: 11: Broughton, 2004: 70). It may be the case that the controlled vocabulary or thesaurus does not include a term at the level of specificity required by a particular resource in which case either a higher level term is used or a more specific term needs to be added to the thesaurus (Lancaster, 1998:30).

Since the topics of the Case Studies and Support Guides are extremely broad, while a Nesstar Bank of Variables/Questions index requires very specific index terms, the level of specificity of indexing required for each corpus can be represented as below:

Fig 2. Level of Specificity of Indexing required by the content of the SKOS-HASSET Corpora

Case Studies and Support Guide Corpus
Case Studies and Support Guides have been collected together under a broad topic classification scheme and tagged with UK Data Archive Subject Categories. These were mapped to HASSET terms for the SKOS-HASSET project and became our gold standard. The case study “Unemployment and Psychological Well-Being” for example is classified under the topics LABOUR AND EMPLOYMENT + ECONOMICS, both of which are HASSET top terms for very large hierarchies.

Economic and Social Data Service (ESDS)/UK Data Service Catalogue Corpus
The main resource for information retrieval managed by the UK Data Archive is the ESDS (and soon to be UK Data Service) catalogue of studies. This is a union catalogue shared by members of the Service and includes records of a variety of data collections for quantitative analysis, including census data; data suitable for qualitative analysis; and historical research resources.

The aim of indexing a data collection is to retrieve a particular study as well as providing access to the variables available for analysis within the particular study.

While a cardinal rule of indexing is not to employ ‘multiple indexing’ (de Keyser, 2012: 12), that is to introduce redundant terms (Lancaster, 1998:280), a dataset is not one discrete object of information in the same way a document is. Within a data collection the object or objects to be retrieved are variables or questions as well as documents, or support guides, and all need to be accessed together with their associated documents. It is often necessary therefore to use ‘mutiple indexing’ or introduce what may appear to be ‘redundant’ terms, which explains the fact that in the catalogue record the high number of top terms together with specific terms, sometimes from the same hierarchy, can be necessary inclusions.

An example in this corpus, which also by chance was included in the Survey Question Bank Corpus, was Study 6843: British Gambling Prevalence Survey, 2010. Although it had been indexed with many top terms the British Gambling Prevalence Survey, 2010 had a total of 39 lower level terms among its 53 HASSET keywords in the ESDS Data Catalogue Records.

Survey Question Bank (SQB) PDF Corpus
The third corpus is the Survey Question Bank PDF collection of documents. In 2007 the Archive inherited an organized collection of survey documentation and enriched associated materials in PDF format from the University of Surrey (Qb). Survey documentation is processed at the point of ingest and access provided on the catalogue metadata page, however the format these materials arrive in the Archive is in no way standardized. A ‘question bank’ addresses this organization problem and, for similar reasons, many data archives have established ‘question banks’.

The SQB PDF proposed terms for HASSET included survey methods terms. These and other non-thesaurus terms that the Archive inherited from the original Qb questionnaires, processed pre-2007 by the Qb team at the University of Surrey, became ‘stop words’. The Archive inherited this organized collection of survey documentation and enriched associated materials in PDF format. It is a format in which the documentation is deposited and, as mentioned above, the PDF ‘keyword field’ limits the number of terms that can be entered which artificially reduced the number of keywords chosen. The selected keywords need to be comparatively specific in this corpus to allow both the retrieval of the document itself and particular questions together with information about innovative survey methodology.

The documentation includes questionnaires, interview instructions, letters, consent forms, interviewer observations, nurse schedules, technical reports and user guides. For the larger surveys a number of questionnaires may have been administered, each of which is indexed. In these cases more specific additional terms may have been required than were in the study list of keywords. In our example of the British Gambling Prevalence Survey, 2010, however, there was only one questionnaire so most of the original gold standard keywords provided by the catalogue record could be used for the SQB PDF document. In the ESDS/UK Data Service Catalogue Record Corpus 14 top terms were used for the study British Gambling Prevalence Survey, 2010, while the SQB PDF corpora used only 2 top terms, which included GAMBLING (the topic of the survey) and EMOTIONAL STATES (a very small hierarchy), reflecting the need for the catalogue record keywords to be broader in scope than the keywords required for particular items of survey documentation.

Nesstar Bank of Variables/Questions Corpus
The final corpus is a database of 26753 indexed variables from 35 research datasets available in the Economic and Social Data Service (ESDS)/UK Data Service Nesstar catalogue. Nesstar is a free online data analysis tool.

For the SKOS-HASSET project 26753 variables were indexed in the period between mid- August and the end of October 2012. These variables were either associated with questions, parts of questions or marked sections of the administered questionnaire. Unlike SQB indexing Nesstar indexing was undertaken to locate variables, whereas SQB indexing was undertaken to locate questions.

While variables appear similar to questions a single question can be translated into multiple variables (for example a question with five possible, exclusive multiple choice answers can result in five variables). Survey datasets are made up of the codes corresponding to variables encoded either by a Blaise program or encoded manually or a combination of both, taking account of responses to survey questions. The codes may be binary or a more complex array. Where the answer to a question could be ‘yes’/’no’ responses then a variable and question can be the same entity of text. However, where the question requires an open numerical response, (for example how many hours it has taken to complete a task), a variable cannot be meaningfully indexed. In addition some variables are derived. Though present in the data derived variables did not appear in the Nesstar Bank of Variables/Questions Corpus which avoided a difficulty as challenging to manual indexing as it would be for Kea.

Though Nesstar does group variables in a logical arrangement when indexing the more complicated Blaise structured questionnaires it is not easy to attach follow-up questions to the source question without referring back to the questionnaire which is a time consuming process. Individual questions out of context of the schedule that delivered them cannot be hand indexed with ease nor with ensured accuracy. In addition a variable may be constructed via a standard measure such as a copyrighted scale. The component parts of the scale are questions but whether these comprise one or ten variables will be a coding decision and for important questionnaire design reasons questions that comprise the scale are not always asked in sequence.

Addressing these problems in the task of manually indexing the Nesstar Corpus meant much more time was required to perform variable indexing than indexing a questionnaire. Indexing by variable led to much repetition of keywords and many keywords per variable to achieve specificity. Not all individual variables were suitable for indexing. The task was made especially difficult by the fact that in the majority of cases Blaise schedules were used for data collection and thus the sequence in which the question is asked is not easy to follow. Although efforts were made not to repeat the indexing process when a question was encountered again, and variables are entered twice when they are found to belong to two or more groups, this was not always possible. As a result, the labour required to index variables individually was multiplied in a similar fashion. These issues increased the work required to undertake the indexing from an estimated 27 person days to 34 person days (fte) work, inclusive of planning and testing evaluation forms; that is from an estimated 1000 files per day down to 800 files per day.

An example will be given below with reference to study 5294 to illustrate that indexing at variable level required extra, more specific keyword than the catalogue record list.

4.2 Associative terms

Lancaster (1998: 30) says beyond the principle of specificity no real rules of indexing have been developed, only theories. Nevertheless he proposes two rules which, when followed, can lead to the use of “associative terms”. They are:

  • Include all the topics known to be of interest to the users of the information service that are treated substantively in the document
  • Index each of these as specifically as the vocabulary of the system allows and the needs or interests of the users warrants (Lancaster, 1998: 31)

“Associative terms” should be distinguished from “related terms”. “Related terms” are terms within the same hierarchy. Generally, in indexing practice, this is “redundancy” or “multiple indexing” and not recommended. In practice however for indexing variables, some “multiple indexing” may be required. In evaluation we will look at the context before assigning this category. “Associative terms”, on the other hand, are terms that are not related in hierarchical structures of a thesaurus (Lancaster, 1998:15; Broughton, 2006: 129). They are terms used in indexing in combination to cover a concept and may be just as necessary when indexing a variable or question as it is in larger textual contexts. A common example in the ESDS Data Catalogue Records Corpora is the combination of the terms CHILDREN + HEALTH to cover the concept of CHILD HEALTH.

The use of “associative terms” is generally not easy to see in a long list of keywords but is very apparent when the text is as short as it is in the variable files of the Nesstar Corpus. Their use will depend both on how complex the concept may be and also the availability of suitable terms within the thesaurus. There will be a lead term followed by “qualifiers”, each of which is dependent on the previous term (Lancaster, 1998: 56). These “qualifiers” are “context dependent”.

An example in the Nesstar Bank of Variables/Questions Corpus is a variable from study number 5294 Workplace Employee Relations Survey, 2004: Cross-Section Survey, 2004 And Panel Survey, 1998-2004; Wave 2. The question is:

For each of the above groups of employees, how many are in each of the following occupational groups? Protective and personal services: Full-time females”.

In the questionnaire ‘personal services’ are defined as ‘caring, leisure and other personal service occupations’. The HASSET term SERVICE INDUSTRIES covers ‘leisure and other personal services as well as protective services’. It does not cover ‘caring’. The HASSET term CARE comes with an instruction to choose a more specific term, while CAREGIVERS refers to “non-professionals”, therefore the allocated keywords were kept at the level of specificity illustrated in Figure 3.

Fig 3. Associative HASSET terms to describe a variable in the Workplace Employee Relations Survey, 2004: Cross-Section Survey, 2004 And Panel Survey, 1998-2004; Wave 2 dataset.

In this ‘context-dependent’ combination of ‘associative terms’ though SERVICE INDUSTRIES refers to OCCUPATIONS the variable is one in a series of questions about ‘occupations’ so it is important to include this term to allow for analysis just at the level of OCCUPATION. In the catalogue record the list of keywords include OCCUPATIONS, WOMEN and FULL-TIME EMPLOYMENT. It does not include SERVICE INDUSTRIES. Thus indexing for the Nesstar Bank of Variables/Questions Corpus requires a greater level of specificity in the selection of keywords than was required for indexing the ESDS Catalogue Record Corpus.

In this example the question is one of a series on “occupations”. Other connected questions will not necessarily follow in the same sequence. The problem is that of how a variable may be operationalized, as discussed above. Indexing completely at variable level, which the Nesstar Corpus required, will not necessarily group the variables in a convenient manner for the indexer to translate the concept’s operationalized definition back into a general language definition. Principles of question design make it mandatory that all respondents will attribute the same meaning to the question/variable but that meaning does not have to be the measured concept which indexing will aim to capture. For the variable ETHNICITY, discussed above, RELIGION is an indicator. Religion questions are harmonized to ensure that the meaning will not be in anyway ambiguous and the measurement is standard. However the respondent may not know that data is being collected on ETHNICITY. If the measurement is a subjective one however, that is if the respondent is asked what ethnic group they belong to, then a straight forward HASSET term ETHNIC GROUPS will apply (see the ONS guide Ethnic Group Statistics: A Guide for the Collection and Classification of Ethnicity Data).

However it should be kept in mind that these associations do rely on the availability of suitable terms within the thesaurus that follow the rules of word combination. There are word combinations that should not be split in order to preserve meaning (de Keyser, 2012: 20).

5. Conclusion

The level of specificity of gold standard indexing for our four corpora reflected the type of information object a user may wish to retrieve. The Case Studies and Support Guides Corpus is a relatively small collection of documents and requires a few broad terms to ensure retrieval. Other corpora require the application of a greater degree of specificity in the keywords chosen as well as the use of associative terms, which may or may not be visible depending on the size of the text file.

Nesstar indexing was undertaken to locate variables, which in the majority of cases match questions or parts of questions and sometimes sections of the questionnaire. They are complex in design and difficult to process individually. SQB indexing on the other hand was undertaken to locate questions. Both corpora require a considerable level of specificity in the selection of keywords to cover concepts of some complexity that will at times exhaust the availability of terms within a thesaurus hierarchy and perhaps require it to be extended.

We expect to release the evaluation results in the next few weeks.

References

Broughton, V. (2004) Controlled Indexing Languages. In Essential Classification. London, Facet.

Broughton, V. (2004) Faceted Classification. In Essential Classification. London, Facet.

Broughton, V. (2006) Essential Thesaurus Construction. London, Facet.

Bulmer, M. ed. (2010) Social Measurement through Social Surveys: an applied approach. Farnham, Ashgate.

De Keyser, P. (2012) Indexing: From Thesauri to the Semantic Web. Oxford, Chandos.

Hyman, L., Lamb, J. and Bulmer, M. (2006) The Use of Pre-Existing Survey Questions: Implications for Data Quality. Proceedings of European Conference of Quality in Survey Statistics, Rome. Retrieved on 05/01/2013.

Kneeshaw, J. (2011) The UK’s Survey Question Bank: Present and Future Developments. Paris, Question Database Workshop, Reseau Quetelet.

Lancaster, F. W. (1998) 2nd ed. Quality of Indexing. In Indexing and Abstracting in Theory and Practice. Champaign, Illinois, University of Illinois.

Office for National Statistics (2003) Ethnic Group Statistics: A Guide for the Collection and Classification of Ethnicity Data. Retrieved from http://www.ons.gov.uk on 05/01/2013.

Posted in Evaluation, Indexing | Leave a comment

2012 review, and a look forward to 2013

As the year draws to an end, it feels right and proper to reflect on what we have achieved so far in the SKOS-HASSET project.

The project began in June and, since then, it has done the following:

We have more work to do, which will take us into the Spring of 2013.  Forthcoming tasks include:

  • Finishing the manual evaluation of the automatic indexing exemplar
  • Writing the automatic indexing recommendations report
  • Applying version control to the thesauri
  • Releasing a new version of the thesauri
  • Releasing SKOS-HASSET on the web, via Pubby, using the recommendations of the licensing report
  • Running a webinar on the use of the thesaurus
  • Creating a leaflet to encourage wider use of the thesaurus

We are looking forward to working more with SKOS and HASSET in the New Year.  In the meantime, the SKOS-HASSET project team would like to wish all its blog readers a very Happy Christmas!

Lucy Bell, UK Data Archive

Posted in Project Management | Leave a comment