We’d like to share, in a bit more detail, the contents of the #SKOS-HASSET work packages and how these are going to be achieved in the time available. As ever, any comments or questions will be welcomed.
Timeline and work packages
Our project started in June 2012 and is due to finish in January 2013. In that time, we have a lot to do, but we are still on track. The creation of the ‘question/variable bank’ has taken very slightly longer than anticipated, due to technical complications in retrieving the question data from Nesstar. This particular task involves indexing 30,000 questions. As this indexing represents the manual indexing (gold) standard, to be used for comparison purposes, we won’t need the indexed questions until 1 October when the comparison work begins. We are on track to have achieved this indexing by then.
Our Gantt chart appears below:
Our work packages , in a more detail, are:
WORKPACKAGE 1: Thesaurus preparation: SKOS application and hierarchy merger, 6 June 2012 – 23 July 2013
Objective: This work package will apply SKOS to HASSET and align HASSET with ELSST at database level in preparation for other work package tasks.
2. The two existing sets of hierarchies (for HASSET and ELSST) will be successfully merged. This will involve the machine-driven comparison of the two hierarchies, plus the manual merging of terms and hierarchies. The new ISO 25964 will be used in relation to this work (should it have been released by then). SKOS will be re-applied to the combined hierarchies.
Deliverables: WP1.D2: SKOS-HASSET will be created. WP1.D1: Unified database: a single thesaurus merged from the existing ELSST and HASSET hierarchies using, if available, ISO 25964-2.
Milestones: 1st version of SKOS-HASSET, validated using industry tools. Merger of underlying hierarchies, ready for comparison.
WORKPACKAGE 2: Automatic indexing exemplar, 2 July 2012 – 21 December 2012
Objective: This work package will take SKOS-HASSET and apply it to a selected, representative part of the ESDS collection to test its automatic indexing capabilities.
- Apply automatic indexing to a set of textual sources and compare the results with the gold standard of manual indexing. This task includes three sub-tasks:
a) Assemble a representative set of test data, with terms assigned. Four kinds of data sources, which have existing terms assigned, will be included:
i) A ‘question bank’ of well-known robust questions used to measure key socio-economic concepts will be set up, formally linked to assigned concepts/terms. This ‘question bank’ will relate to the representative set of data and builds on existing work at the Archive to link question text to surveys in Nesstar (Nesstar DDI metadata will be used).
ii) Documentation/questionnaires for large-scale social surveys, such as the government and major longitudinal surveys will be converted to text.
iii) Structured catalogue metadata for all 5000 data collections, including abstracts and other fields will be converted to text.
iv) Other bibliographic items will also be converted to text (the Archive creates its own online support guides (around 200) and writes up case studies (over 100)).
b) Apply automatic indexing to these items, using SKOS-HASSET as the source vocabulary, and employing at least two existing NLP technologies (such as TF/IDF and LSAI/A) and at least two tools (such as WEKA, GATE, OpenNLP, HIVE – which dynamically integrates discipline-specific controlled vocabularies encoded with SKOS, or Apache Nutch – which would fit with the Archive’s existing suite of Lucene- and Solr-based applicationsor NaCTem tools).
c) Compare the results with the Archive’s existing, manual indexing process. As a second level of evaluation, the Archive’s results will also be compared with those experimental results acquired by third parties. The evaluative part of this work will: specify the workflows involved in the processes of term extraction, indexing and subsequent checking and editing; investigate better ways of integrating term recognition technology with automated indexing systems; test and evaluate the efficacy of the tools amongst internal Archive indexers and with appropriate stakeholders.
- The project team will finally make recommendations on approaches to reliable and robust automated term assignment procedures. These will inform the Archive’s data processing workflows
Deliverables: WP2.D1: Exemplar, demonstrating the automatic indexing functionality of SKOS-HASSET using a selection of text mining tools and comparing results to the gold standard of manual indexing. WP2.D2: Recommendations report on approaches to automated indexing for social science data resources.
Milestones: Collation of set of test data for text mining. Results set of textual documents with SKOS-HASSET terms applied, ready to be compared with the gold standard.
WORKPACKAGE 3: Online user pages and interfaces, 12 November 2012 – 11 January 2013
Objective: This work package will create or, where appropriate, update, user-facing web pages and the management interface to the thesaurus products to facilitate ease of access for the social science community and to streamline updates to the products.
- Mount SKOS-HASSET online along with the browseable tree structure.
- Update the user-facing web pages with new text, guidance and tree structure.
- Streamline the online thesaurus management interface.
WP3.D1: Release of SKOS-HASSET online. WP3.D2: Refreshed online user-facing webpages, plus user guidance and the tree structure. WP3.D3: Updated online thesaurus management interface, for use with the combined hierarchies, permitting the release of new versions of the thesaurus, including new versions of the SKOS product
Milestones: Release of SKOS-HASSET online. Creation of user guidance. Management interface is usable with HASSET and ELSST.
WORKPACKAGE 4: Licensing report, 6 June 2012 – 28 September 2012
Objective: This small but important work package will examine the licensing options available for SKOS-HASSET, comparing various open licences and building on the work of the JISC Digital Infrastructure Team and the UK Data Archive, and referring to the work of the Naomi Korn Copyright Consultancy.
1. Compare licences which could be used in relation to SKOS-HASSET, including the JISC Model Licence, the JISC Open Educational User Licence v 1.0, Open Data Commons, Open Government Data Licence and Creative Commons. The Creative Commons compatibility wizards will also be used. Make recommendations for which licence model to employ.
Deliverable: WP4.D1: Licence recommendation report.
WORKPACKAGE 5: User communication and engagement, 4 June 2012 – 31 January 2013
Objective: to communicate, engage and seek input from the stakeholder and user communities, including funders, existing HASSET/ELSST users, potential users, text miners and thesaurus developers/information scientists.
- A webinar for up to 100 people at a time will be staged to showcase and answer questions on the thesaurus product and its automated indexing capabilities. Gotowebinar will be used (for which the Archive already has a licence). Accompanying user guidance will be written and made available. The webinar will be evaluated and the results reported.
- A regularly-updated project website and online blog plus newsletter will be created to share all developments and work with the wider community. All lessons learned will be evaluated as the project progresses and communicated, via the blog/newsletter.
- A printed and downloadable A5 leaflet will be created, publicising the advances made by the project, and encouraging uptake in the community.
- A virtual user forum will be set up, based on the list of existing HASSET users, with regular JISCmail communications / webinars planned for the future. This user forum will be expanded through active promotion of the project work. An email list will be set up and the user forum will be consulted about developments in a timely manner. Part of the sustainability work of this project will be to gain wider community involvement in the development of HASSET.
- A report, reviewing and assessing all the tasks and techniques applied and collating all the blog posts and lessons learned will be written and submitted to the Programme Manager. WP5.D5: The final project report will synthesise all the work and analyse success.
Deliverables: WP5.D1: SKOS-HASSET and automated indexing webinar prepared, held and evaluated. WP5.D2: Project website, blog and newsletter all set up and populated on a regular basis. WP5.D3: An A5 leaflet will be created and circulated at relevant forums, conferences and seminars. WP5.D4: User forum established, for novice and expert users for knowledge sharing, during the life of the project and after. This will comprise the creation of a list of stakeholders’ email addresses (complying with Data Protection requirements), a JISCmail list and the investigation of the creation of a Wiki, plus regular communications to, and the invitation of responses from, the group. WP5.D5: The final project report will synthesise all the work and analyse success.
Milestones: Website populated. 1st blog released. JISCmail list set up. List of stakeholders contacted. Final draft report submitted.