The gold standard indexes we used for SKOS-HASSET training and evaluation were a combination of our in-house quality controlled indexes and a specially prepared index used for training Kea to index variables. The latter task was performed as exhaustively as possible to enhance training but could not ignore efficiency restraints. All indexes conform to ISO standards.
In–house indexing takes account of:
- the perceived needs of the users
- policy issues incorporating plans for the future development of the collection
Indexing data and indexing documents are slightly different processes:
- Concepts within data are measurements and their definitions may vary
- Concepts within data-related documentation have a general language definition
Indexing the former type of concept is a two stage process of translating its operationalized definition into a general language definition and from that to a thesaurus definition. The latter is a simpler process requiring some form of mapping between a general language definition and the thesaurus definition. This blog explains these two processes in greater detail and outlines how the indexes were prepared with reference to the functions each corpus was designed to perform.
2. Information retrieval requirements
Funded by ESRC, the JISC and the University of Essex, the UK Data Archive is committed to supporting secondary analysis; that is the re-use of quantitative, qualitative or historical data. Data analysis requires knowledge of and access to specialist statistical software such as SPSS or Stata; thus, the Archive’s users are a special clientele who rely on the Archive both to supply data in an appropriate form and to support their use.
Since Archive users are primarily data analysts, an important access requirement is variable information. Some users will be specialists in a particular data collection, others will have broader interests and require subject rather than variable access. Data analysts will wish to retrieve particular variables of interest or to locate studies of interest; survey managers, teachers of survey methods, social and economic researchers and post graduate students wishing to conduct pre-analysis will search for relevant survey questions and other information in the documentation.
To meet these needs the Archive has enriched its resources through the provision of the Nesstar service, the Survey Question Bank project and by producing Case Studies and Support Guides. These resources can be represented thus:
Fig 1. SKOS-HASSET Test Corpora
3. The Corpora
We used four corpora for testing Kea for the SKOS-HASSET project. They comprised:
ESDS Data Catalogue Records with existing study level keywords plus three supplementary corpora
§ Survey Question Bank PDFs with existing keywords providing access to documentation the depositor supplies with the dataset
§ Case Studies with existing subject indexes providing examples of how Archive studies have been used in teaching and research and Support Guides providing an overview of the data collections and internal procedures.
§ Nesstar Bank of Variables/Questions including only research datasets with a tailor-made index
These supplementary resources have evolved to provide public access to ‘data-related’ documentation and to view frequencies and make cross tabulations using the Nesstar service.
Procedures for cataloguing and indexing have evolved alongside the development of the Archive’s thesaurus HASSET. Initially HASSET was based on the UNESCO Thesaurus (1977). Now established as a general social science thesaurus, indexing is directed towards providing the best access to variables within the data. Limits are not imposed on the number of index terms permitted per data collection in order to maintain a high standard of information retrieval. Cataloguers select as many new terms as they find necessary which may vary between 10 and 450 terms. Data indexing tends to incorporate both broad and narrow terms to cover the variation of concept levels that variables may include.
Data are a complex object in information retrieval terms. When concepts are operationalized the indicators specified will rarely be found in the same thesaurus hierarchy nor necessarily have a lexical relationship. For example the Office for National Statistics webpage Different Aspects of Ethnicity recommends that the variables ‘Country of birth’, ‘Nationality’, ‘Language spoken at home’, ‘Skin colour’, ‘National/geographical origin’, and ‘Religion’ should together indicate the concept of ‘Ethnicity’. At the same time indicators such as ‘Religion’ may also contribute to the operationalization of other concepts. Providing this information, together with information on data collection methods, data preparation and results or findings is best practice for data depositors and is key to enabling the secondary user to make informed use of the data. It is information that is incorporated in the catalogue record abstract and subject categories (from DDI element of <topCclas>) which were used for Kea.
All four corpora then reference data collections in the data catalogue, indexed with HASSET terms. Since each corpus was arranged to meet specific user needs, different strategies of indexing were necessary for each corpus and thus training for Kea was done separately.
4. Manual Indexing Practices
Two principles of indexing are worth discussing in the context of providing a gold standard for Kea. They are:
- Associative terms
4.1 The Principle of Specificity
The central principle of indexing is to use the most specific term that entirely covers the topic (Lancaster, 1998: 28; de Keyser, 2012: 11: Broughton, 2004: 70). It may be the case that the controlled vocabulary or thesaurus does not include a term at the level of specificity required by a particular resource in which case either a higher level term is used or a more specific term needs to be added to the thesaurus (Lancaster, 1998:30).
Since the topics of the Case Studies and Support Guides are extremely broad, while a Nesstar Bank of Variables/Questions index requires very specific index terms, the level of specificity of indexing required for each corpus can be represented as below:
Fig 2. Level of Specificity of Indexing required by the content of the SKOS-HASSET Corpora
Case Studies and Support Guide Corpus
Case Studies and Support Guides have been collected together under a broad topic classification scheme and tagged with UK Data Archive Subject Categories. These were mapped to HASSET terms for the SKOS-HASSET project and became our gold standard. The case study “Unemployment and Psychological Well-Being” for example is classified under the topics LABOUR AND EMPLOYMENT + ECONOMICS, both of which are HASSET top terms for very large hierarchies.
Economic and Social Data Service (ESDS)/UK Data Service Catalogue Corpus
The main resource for information retrieval managed by the UK Data Archive is the ESDS (and soon to be UK Data Service) catalogue of studies. This is a union catalogue shared by members of the Service and includes records of a variety of data collections for quantitative analysis, including census data; data suitable for qualitative analysis; and historical research resources.
The aim of indexing a data collection is to retrieve a particular study as well as providing access to the variables available for analysis within the particular study.
While a cardinal rule of indexing is not to employ ‘multiple indexing’ (de Keyser, 2012: 12), that is to introduce redundant terms (Lancaster, 1998:280), a dataset is not one discrete object of information in the same way a document is. Within a data collection the object or objects to be retrieved are variables or questions as well as documents, or support guides, and all need to be accessed together with their associated documents. It is often necessary therefore to use ‘mutiple indexing’ or introduce what may appear to be ‘redundant’ terms, which explains the fact that in the catalogue record the high number of top terms together with specific terms, sometimes from the same hierarchy, can be necessary inclusions.
An example in this corpus, which also by chance was included in the Survey Question Bank Corpus, was Study 6843: British Gambling Prevalence Survey, 2010. Although it had been indexed with many top terms the British Gambling Prevalence Survey, 2010 had a total of 39 lower level terms among its 53 HASSET keywords in the ESDS Data Catalogue Records.
Survey Question Bank (SQB) PDF Corpus
The third corpus is the Survey Question Bank PDF collection of documents. In 2007 the Archive inherited an organized collection of survey documentation and enriched associated materials in PDF format from the University of Surrey (Qb). Survey documentation is processed at the point of ingest and access provided on the catalogue metadata page, however the format these materials arrive in the Archive is in no way standardized. A ‘question bank’ addresses this organization problem and, for similar reasons, many data archives have established ‘question banks’.
The SQB PDF proposed terms for HASSET included survey methods terms. These and other non-thesaurus terms that the Archive inherited from the original Qb questionnaires, processed pre-2007 by the Qb team at the University of Surrey, became ‘stop words’. The Archive inherited this organized collection of survey documentation and enriched associated materials in PDF format. It is a format in which the documentation is deposited and, as mentioned above, the PDF ‘keyword field’ limits the number of terms that can be entered which artificially reduced the number of keywords chosen. The selected keywords need to be comparatively specific in this corpus to allow both the retrieval of the document itself and particular questions together with information about innovative survey methodology.
The documentation includes questionnaires, interview instructions, letters, consent forms, interviewer observations, nurse schedules, technical reports and user guides. For the larger surveys a number of questionnaires may have been administered, each of which is indexed. In these cases more specific additional terms may have been required than were in the study list of keywords. In our example of the British Gambling Prevalence Survey, 2010, however, there was only one questionnaire so most of the original gold standard keywords provided by the catalogue record could be used for the SQB PDF document. In the ESDS/UK Data Service Catalogue Record Corpus 14 top terms were used for the study British Gambling Prevalence Survey, 2010, while the SQB PDF corpora used only 2 top terms, which included GAMBLING (the topic of the survey) and EMOTIONAL STATES (a very small hierarchy), reflecting the need for the catalogue record keywords to be broader in scope than the keywords required for particular items of survey documentation.
Nesstar Bank of Variables/Questions Corpus
The final corpus is a database of 26753 indexed variables from 35 research datasets available in the Economic and Social Data Service (ESDS)/UK Data Service Nesstar catalogue. Nesstar is a free online data analysis tool.
For the SKOS-HASSET project 26753 variables were indexed in the period between mid- August and the end of October 2012. These variables were either associated with questions, parts of questions or marked sections of the administered questionnaire. Unlike SQB indexing Nesstar indexing was undertaken to locate variables, whereas SQB indexing was undertaken to locate questions.
While variables appear similar to questions a single question can be translated into multiple variables (for example a question with five possible, exclusive multiple choice answers can result in five variables). Survey datasets are made up of the codes corresponding to variables encoded either by a Blaise program or encoded manually or a combination of both, taking account of responses to survey questions. The codes may be binary or a more complex array. Where the answer to a question could be ‘yes’/’no’ responses then a variable and question can be the same entity of text. However, where the question requires an open numerical response, (for example how many hours it has taken to complete a task), a variable cannot be meaningfully indexed. In addition some variables are derived. Though present in the data derived variables did not appear in the Nesstar Bank of Variables/Questions Corpus which avoided a difficulty as challenging to manual indexing as it would be for Kea.
Though Nesstar does group variables in a logical arrangement when indexing the more complicated Blaise structured questionnaires it is not easy to attach follow-up questions to the source question without referring back to the questionnaire which is a time consuming process. Individual questions out of context of the schedule that delivered them cannot be hand indexed with ease nor with ensured accuracy. In addition a variable may be constructed via a standard measure such as a copyrighted scale. The component parts of the scale are questions but whether these comprise one or ten variables will be a coding decision and for important questionnaire design reasons questions that comprise the scale are not always asked in sequence.
Addressing these problems in the task of manually indexing the Nesstar Corpus meant much more time was required to perform variable indexing than indexing a questionnaire. Indexing by variable led to much repetition of keywords and many keywords per variable to achieve specificity. Not all individual variables were suitable for indexing. The task was made especially difficult by the fact that in the majority of cases Blaise schedules were used for data collection and thus the sequence in which the question is asked is not easy to follow. Although efforts were made not to repeat the indexing process when a question was encountered again, and variables are entered twice when they are found to belong to two or more groups, this was not always possible. As a result, the labour required to index variables individually was multiplied in a similar fashion. These issues increased the work required to undertake the indexing from an estimated 27 person days to 34 person days (fte) work, inclusive of planning and testing evaluation forms; that is from an estimated 1000 files per day down to 800 files per day.
An example will be given below with reference to study 5294 to illustrate that indexing at variable level required extra, more specific keyword than the catalogue record list.
4.2 Associative terms
Lancaster (1998: 30) says beyond the principle of specificity no real rules of indexing have been developed, only theories. Nevertheless he proposes two rules which, when followed, can lead to the use of “associative terms”. They are:
- Include all the topics known to be of interest to the users of the information service that are treated substantively in the document
- Index each of these as specifically as the vocabulary of the system allows and the needs or interests of the users warrants (Lancaster, 1998: 31)
“Associative terms” should be distinguished from “related terms”. “Related terms” are terms within the same hierarchy. Generally, in indexing practice, this is “redundancy” or “multiple indexing” and not recommended. In practice however for indexing variables, some “multiple indexing” may be required. In evaluation we will look at the context before assigning this category. “Associative terms”, on the other hand, are terms that are not related in hierarchical structures of a thesaurus (Lancaster, 1998:15; Broughton, 2006: 129). They are terms used in indexing in combination to cover a concept and may be just as necessary when indexing a variable or question as it is in larger textual contexts. A common example in the ESDS Data Catalogue Records Corpora is the combination of the terms CHILDREN + HEALTH to cover the concept of CHILD HEALTH.
The use of “associative terms” is generally not easy to see in a long list of keywords but is very apparent when the text is as short as it is in the variable files of the Nesstar Corpus. Their use will depend both on how complex the concept may be and also the availability of suitable terms within the thesaurus. There will be a lead term followed by “qualifiers”, each of which is dependent on the previous term (Lancaster, 1998: 56). These “qualifiers” are “context dependent”.
An example in the Nesstar Bank of Variables/Questions Corpus is a variable from study number 5294 Workplace Employee Relations Survey, 2004: Cross-Section Survey, 2004 And Panel Survey, 1998-2004; Wave 2. The question is:
“For each of the above groups of employees, how many are in each of the following occupational groups? Protective and personal services: Full-time females”.
In the questionnaire ‘personal services’ are defined as ‘caring, leisure and other personal service occupations’. The HASSET term SERVICE INDUSTRIES covers ‘leisure and other personal services as well as protective services’. It does not cover ‘caring’. The HASSET term CARE comes with an instruction to choose a more specific term, while CAREGIVERS refers to “non-professionals”, therefore the allocated keywords were kept at the level of specificity illustrated in Figure 3.
Fig 3. Associative HASSET terms to describe a variable in the Workplace Employee Relations Survey, 2004: Cross-Section Survey, 2004 And Panel Survey, 1998-2004; Wave 2 dataset.
In this ‘context-dependent’ combination of ‘associative terms’ though SERVICE INDUSTRIES refers to OCCUPATIONS the variable is one in a series of questions about ‘occupations’ so it is important to include this term to allow for analysis just at the level of OCCUPATION. In the catalogue record the list of keywords include OCCUPATIONS, WOMEN and FULL-TIME EMPLOYMENT. It does not include SERVICE INDUSTRIES. Thus indexing for the Nesstar Bank of Variables/Questions Corpus requires a greater level of specificity in the selection of keywords than was required for indexing the ESDS Catalogue Record Corpus.
In this example the question is one of a series on “occupations”. Other connected questions will not necessarily follow in the same sequence. The problem is that of how a variable may be operationalized, as discussed above. Indexing completely at variable level, which the Nesstar Corpus required, will not necessarily group the variables in a convenient manner for the indexer to translate the concept’s operationalized definition back into a general language definition. Principles of question design make it mandatory that all respondents will attribute the same meaning to the question/variable but that meaning does not have to be the measured concept which indexing will aim to capture. For the variable ETHNICITY, discussed above, RELIGION is an indicator. Religion questions are harmonized to ensure that the meaning will not be in anyway ambiguous and the measurement is standard. However the respondent may not know that data is being collected on ETHNICITY. If the measurement is a subjective one however, that is if the respondent is asked what ethnic group they belong to, then a straight forward HASSET term ETHNIC GROUPS will apply (see the ONS guide Ethnic Group Statistics: A Guide for the Collection and Classification of Ethnicity Data).
However it should be kept in mind that these associations do rely on the availability of suitable terms within the thesaurus that follow the rules of word combination. There are word combinations that should not be split in order to preserve meaning (de Keyser, 2012: 20).
The level of specificity of gold standard indexing for our four corpora reflected the type of information object a user may wish to retrieve. The Case Studies and Support Guides Corpus is a relatively small collection of documents and requires a few broad terms to ensure retrieval. Other corpora require the application of a greater degree of specificity in the keywords chosen as well as the use of associative terms, which may or may not be visible depending on the size of the text file.
Nesstar indexing was undertaken to locate variables, which in the majority of cases match questions or parts of questions and sometimes sections of the questionnaire. They are complex in design and difficult to process individually. SQB indexing on the other hand was undertaken to locate questions. Both corpora require a considerable level of specificity in the selection of keywords to cover concepts of some complexity that will at times exhaust the availability of terms within a thesaurus hierarchy and perhaps require it to be extended.
We expect to release the evaluation results in the next few weeks.
Broughton, V. (2004) Controlled Indexing Languages. In Essential Classification. London, Facet.
Broughton, V. (2004) Faceted Classification. In Essential Classification. London, Facet.
Broughton, V. (2006) Essential Thesaurus Construction. London, Facet.
Bulmer, M. ed. (2010) Social Measurement through Social Surveys: an applied approach. Farnham, Ashgate.
De Keyser, P. (2012) Indexing: From Thesauri to the Semantic Web. Oxford, Chandos.
Hyman, L., Lamb, J. and Bulmer, M. (2006) The Use of Pre-Existing Survey Questions: Implications for Data Quality. Proceedings of European Conference of Quality in Survey Statistics, Rome. Retrieved on 05/01/2013.
Kneeshaw, J. (2011) The UK’s Survey Question Bank: Present and Future Developments. Paris, Question Database Workshop, Reseau Quetelet.
Lancaster, F. W. (1998) 2nd ed. Quality of Indexing. In Indexing and Abstracting in Theory and Practice. Champaign, Illinois, University of Illinois.
Office for National Statistics (2003) Ethnic Group Statistics: A Guide for the Collection and Classification of Ethnicity Data. Retrieved from http://www.ons.gov.uk on 05/01/2013.