SKOS-HASSET Evaluation Plan

Lorna Balkan and Mahmoud El-Haj

1. Aim of the text-mining task

This blog post will describe our evaluation plan.  The aim of the SKOS-HASSET text mining task is to investigate the potential of SKOS-HASSET by applying it automatically, using a variety of techniques and tools, to the selected contents of the UK Data Archive’s collections and comparing the results against:

(1)    The manually-indexed keyword gold-standards.

(2)    Other organisations’ use of HASSET for automatic indexing, where comparisons can be made and the third parties are willing.

Testers:  Archive indexers and appropriate stakeholders.

General questions we want to answer:

  • How well do the text mining tools we use compare to human indexers?
  • Which text mining tools perform best?
  • Which test collection do the tools work best on (and why)?
  • How does the best automatic indexing tool perform on Archive texts compared to the third party tools on their texts? (Qualitative/opinion-based judgement.)


The full Humanities and Social Science Electronic Thesaurus (HASSET) consists of  12,233 hierarchically arranged terms, including 7,634 descriptors or preferred terms (which are used for indexing) – the rest are synonyms or non-preferred terms.   Geographical are excluded from the text mining exercise, leaving a total of 8,830 terms (preferred and non-preferred).

Anticipated problems and challenges:

  • There are a large number of closely related terms in full HASSET (this may cause problems for the machine learning algorithm to discriminate).
  • Difference in level of abstraction; some terms (e.g. STUDENT SOCIOLOGY) are very abstract and are unlikely to appear verbatim in texts, whereas others (e.g. MENTAL HEALTH) are much less abstract and more widely used.
  • Difference in size and format; terms can be single words or multi-word units, and may contain qualifiers (e.g. ADVOCACY (LEGAL)) which may not appear verbatim in texts.
  • Synonymy: different words with identical or very similar meanings. Thesauri control synonymy by choosing one word or term as the preferred term and making its synonyms non-preferred terms. Mapping synonyms to their preferred terms is as a challenge for automatic indexers.
  • Polysemy: the coexistence of many possible meanings for a word or phrase. Like other thesauri, HASSET controls polysemy by restricting the meaning of its terms to avoid ambiguity. For example, the meaning of the term “COURTS” in HASSET is restricted to mean “LAW COURTS”, given its position in the hierarchy “ADMINISTRATION OF JUSTICE”.  Polysemy is a challenge for automatic indexing.  Supervised machine learning algorithms with features for disambiguation have been successful in tackling this problem (see Dash 2002, 2005, 2008).
  • Some HASSET terms are used in a very particular sense in the thesaurus (e.g. “SCHOOL –LEAVING” versus “SECONDARY SCHOOL LEAVING”). The scope note for “SCHOOL-LEAVING” says: “USE FOR LEAVING SCHOOL UP TO COMPLETION OF COMPULSORY EDUCATION. FOR SCHOOL LEAVING UPON COMPLETION OF COMPULSORY EDUCATION, USE THE TERM SECONDARY SCHOOL LEAVING“.  Again, this presents a challenge for machines.
  • Plural form: the convention in HASSET is to use the plural form of count nouns (e.g. TOWNS, not TOWN), while both singular and plural forms are found in texts.
  • Spelling variants: many words can have different spellings. The use of -ization versus -isation is an example, as is the use of hyphenation. HASSET terms ending in -ization should match words ending in both -isation and –ization.
  • Some HASSET terms are used chiefly as placeholders in the thesaurus , with a scope note to say use a more specific term instead (e.g. “RESOURCES”, which has the following scope note: “AVAILABLE MEANS OR ASSETS, INCLUDING SOURCES OF ASSISTANCE, SUPPLY, OR SUPPORT (NOTE: USE A MORE SPECIFIC TERM IF POSSIBLE). (ERIC)”). These terms may therefore be assigned more often by the automatic indexer than by the human indexer.

3. Corpora

We are currently in the process of preparing and describing the corpora we intend to use for the automatic indexing process. The material includes:

  1. The bank of variables/questions (individual variables indexed, each with HASSET terms specific to themselves).
  2. Survey Question Bank (SQB) questionnaires.
  3. ESDS data catalogue records.
    1. abstracts (from all catalogue records).
    2. full catalogue records (from Study Number 5000 onwards: these are the most recent catalogue records, dating from 2005).
    3. Other full-text documents.
      1. case studies.
      2. support guides.

The first corpus (bank of questions) is currently being indexed manually. The fourth corpus (case studies and user guides) have been indexed using UK Data Archive subject categories, which need to be mapped to HASSET terms. This work is also ongoing.  Corpus 3 (catalogue records and documentation) contains HASSET index terms derived from the data and documentation.  Corpus 3a (abstracts) have not been indexed separately from the rest of the data and documentation. The aim here is to see how the terms that the automatic indexer suggests for these records match up with the manually indexed terms applied in relation to the data.  Will it be possible to use documentary evidence and text associated with data to generate effective and useful HASSET terms?

Possible problems and challenges for the text mining task include:

  • The difference in the size of corpora subtypes (and within some subtypes, e.g. case studies).
  • The different number and type of terms assigned to each subtype of corpus (all corpora  have been indexed either with the full set of HASSET terms, or with the smaller subset that has been mapped to UK Data Archive subject categories).
  • The large variation in the number of terms assigned within some subtypes of corpus, (e.g. catalogue records, where the number of terms assigned ranges from 3 to 468).
  • Some HASSET terms are very commonly used, while others hardly used– rare terms will be harder to train.
  • Older documents have been indexed with an older version of HASSET.
  • Some older documents contain OCR’ed files, which will be harder for the automatic indexer to process
  • Most corpora contain a degree of spelling errors.

3.1 Training versus test corpora

For supervised machine learning tasks, each corpus needs to be divided into a training corpus and a test corpus. The automatic indexer is trained on previously indexed material (the training corpus) and then tested on new, unseen test material (the test corpus). Since our corpora are all somewhat different, we have decided to use separate training corpora for each sub-corpus.

4. Evaluation criteria

4.1. Overview

There are different ways of evaluating text mining systems.  For example, we can look at ‘usability’ testing which covers functionality, reliability, usability, efficiency and maintainability of the system, and is concerned with the usefulness of a system to its users (see Ananiadou et al. 2010).

In this project we are only concerned with evaluating accuracy, however, which ‘only tells us how well the system can perform the tasks that it is designed to perform, without considering methods of user interaction’ (Ananiadou et al. 2010).   We are not creating a tool or user interface for applying terms automatically, but rather testing some functionality. Any evaluation of user interaction is out of scope.

In terms of accuracy, we can evaluate either the accuracy of the system in the indexing task, or, alternatively, the accuracy of the system in the information retrieval task.  In this project, we are only interested in accuracy of the system with regard to the indexing task.

The accuracy or effectiveness of a classifier will be measured by the degree of overlap between the automated classification decisions and those originally made by the human indexer (the ‘gold standard’).

A classifier can make either a ‘hard’ classification decision (i.e. take a binary decision, where a keyword is assessed as either relevant to a document or not) or a ‘soft’ classification decision (i.e. where a keyword is assigned a numeric score (e.g. between 0 and 1), that reflects the classifier’s confidence that the keyword is relevant to the document. Hard classification is more appropriate for classifiers that operate without human intervention. Soft classification is more appropriate for systems that rank keywords in terms of their appropriateness to a document, but where a human expert makes the final decision (see Sebastiani 2006).

In our experiments, we assume that the classifier will present a ranked list of candidate terms to the human expert for final decision-making.

We use the following metrics to measure performance:

  • Precision:  the fraction of retrieved instances that are relevant. Precision can be interpreted as the number of keywords judged as correct divided by the total number of keywords assigned.
  • Recall:  the fraction of relevant instances that are retrieved. Recall can be interpreted as the number of correct keywords assigned, divided by all keywords deemed relevant to the object.
  • F1 score (or F-measure, F-score):  F1 score considers both recall and precision to measure accuracy. It can be interpreted as a weighted average of the precision and recall (best value 1, worst 0).

4.2 Statistical significance testing

The purpose of statistical significance testing is ‘to help us gather evidence of the extent to which the results returned by an evaluation metric are representative of the general behaviour of our classifiers.’  (see Japkowicz 2011).  In other words, can the observed results be attributed to real characteristics of the classifiers under scrutiny or are they observed by chance?

The t-test assesses whether the means of two groups are statistically different from each other (see Fisher). This analysis is appropriate whenever you want to compare the means of two groups. In our work on SKOS-HASSET, where appropriate, we determine any significant differences by performing pairwise t-tests (p < 0.05) using the R statistics package. This means that five times out of a hundred you would find a statistically significant difference between the means even if there was none.

We are currently working out the exact details of how to perform the evaluation.Questions we are considering include:

  • Who should evaluate the system output against the gold standard? (The original indexer will provide context and evidence for their decisions; a combination of the indexer plus a third party may be best.)
  • How do we judge precision and recall – in other words, how close does a term need to be to the gold standard?

5. Document preparation and representation

Document preparation involves some or all of the following steps:

  • Convert documents to plain text
  • Apply tokenization:  break the stream of text into words, phrases, symbols, or other meaningful elements called tokens.
  • Remove ‘stop words’ (i.e. tokens/keywords that bear no content, such as articles and prepositions, or whose content is not discriminating for the document collection (e.g. ‘data’ in our experiments)).
  • Apply stemming:  tokens are reduced to their ‘stem’ or root form.  For example “searcher”, “searches”, “searching”, “searched” and “searchable” would all be reduced to “search”. The stems may not be real words – e.g. “computation” might be stemmed to “comput”.  A system that converts a word to its linguistically correct root (“compute” in this case) is called a lemmatiser. In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of Information Retrieval (IR) applications. Stemming/lemmatising also abstract away from common spelling variants (e.g. –ization/isation) and reduce the number of distinct terms needed for representing a set of documents and thus save storage space and processing time.
  • Transform documents into a vector space, whose dimensions correspond to the terms that occur in the training set.
  • Apply a weight to each term: weightsare intended to reflect the importance a word has in determining the semantics of the document it occurs in, and are automatically computed by weighting functions.
  • Apply document length normalisation

 Document length normalisation

As we are applying term weighting (term importance in a document), we need to consider document’s length. Long documents usually use the same terms repeatedly. Therefore, term frequency may be larger giving a higher weight to non-stop words. Furthermore, long documents have many different terms, which increase the number of matches between a query and a long document (see Singhal 1996). Document length normalisation is a way to overcome the above problems by penalising the term weights for a document in accordance with its length. In our project document length normalisation will be applied to ensure a fair recall/precision and F-measure scores between short and long documents.

One common normalisation technique is Cosine-Normalisation:

Where ‘Wi’ is the weight for a term. Cosine-Normalisation solves the problem of higher term frequencies and more terms.

We are currently reviewing different tools and methods for each of the above steps.

6. Text mining Tools/techniques

We are currently reviewing different tools and techniques for performing automatic indexing. These include term frequency–inverse document frequency (TDF/IDF) and Kea.

7. Experiments and evaluation methodology

Supervised learning techniques need to undergo a number of steps that include:

  1. Pre-process text.
  2. Extract terms.
  3. Map terms to HASSET.
  4. Compare results with gold standard.
  5. Tune parameters to maximise precision and recall.
  6. Compare results to gold standard.

8. Presenting the results

We are currently reviewing the literature for ways in which to present our results. Hilaoutakis (2009), for example, describes a comparative study of three systems using precision and recall. Steinberger et al. (2012) report correlations between, amongst other things, precision, recall and F1 with the number of stopwords used, document collection size and number of keywords in the thesaurus.

As soon as we have some results to share, another blog post will follow.


Ananiadou,  S., Thompson,  P., Thomas, J., Mu, T., Oliver, S., Rickinson, M., Sasaki, Y., Weissenbacher, D. and McNaught,  J. (2010) ‘Supporting the education evidence portal via text mining’, Philos Trans R Soc, 368, pp.38293844.

Dash, N.S.  (2002) ‘Lexical polysemy in Bengali: a corpus-based study’, PILC Journal of Dravidic Studies, 12(1-2), pp.203-214.

Dash, N.S. (2005) ‘The role of context in sense variation: introducing corpus linguistics in Indian contexts’,  Language In India, 5(6), pp.12-32.

Dash, N.S. (2008) Context and contextual word meaning,  Journal of Theoretical Linguistics, 5(2), pp. 21-31.

Dryad HIVE Evaluation,

Fisher, R.A. (1925) Statistical Methods for Research Workers, 1st ed., Edinburgh: Oliver & Boyd.

Funk M.E., Reid C.A. and McGoogan, L.S. (1983) ‘Indexing consistency in MEDLINE’, Bull Med Libr Assoc,. 1983, 2(71), pp.176–183.

Hliaoutakis, A. (2009), Automatic term indexing in medical text corpora and its application to consumer health information systems, Master’s thesis, Department of Electronic and Computer Engineering, Technical University of Crete, Greece, December 2009.

Jacquemin, C. and Daille, B. (2002) ‘In Vitro Evaluation of a Program for Machine-Aided Indexing’, Information Processing & Management, 38, Issue 6, November 2002, pp. 765–792.

Japkowicz, N. (2011) Performance evaluation for learning algorithms, online tutorial.

Névéol, A, Zeng, K. and Bodenreider, O.  (2006) ‘Besides Precision & Recall Exploring Alternative Approaches to Evaluating an Automatic Indexing Tool for MEDLINE’, AMIA Annual Symposium Proceedings 2006, pp.589–593

Pouliquen, B., Steinberger, R. and Ignat, C.  (2003)Automatic annotation of multilingual text collections with a conceptual thesaurus, Proceedings of the Workshop Ontologies and Information Extraction at EUROLAN 2003,

Sebastiani, F. (1999) ‘A tutorial on automated text categorisation’, in  A.  Amandi  and A. Zunino (eds.), Proceedings of the 1st Argentinian Symposium on Artificial Intelligence (ASAI 1999), Buenos Aires.

Sebastiani, F. (2006) ‘Classification of text, automatic’, in K. Brown (ed.) (2006) Encyclopedia of Language and Linguistics, 2nd ed., 2, pp.457-462, Oxford: Elsevier.

Singhal, A., Buckley, C. and Mitra, M. (1996) ‘Pivoted document length normalization’ In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’96), ACM, New York, NY, USA, pp.21-29.

Steinberger, R., Ebrahim M., and Turchi, M. (2012) ‘JRC EuroVoc Indexer JEX – A freely available multi-label categorisation tool’, LREC conference proceedings 2012.

Van Rijsbergen, C.J.  (1979) Information Retrieval, 2nd ed., Newton, MA :Butterworth-Heinemann.

This entry was posted in Evaluation. Bookmark the permalink.

One Response to SKOS-HASSET Evaluation Plan

  1. Pingback: 2012 review, and a look forward to 2013 | SKOS-HASSET

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s