Searching data collections has become a common and important process to computer users. Online search engines nowadays provide satisfying results for users’ queries but this becomes a challenge when a domain specific search is to be performed. Selecting accurate search terms is the main problem for domain specific automatic search. One solution is manual indexing by professional indexers or sometimes by authors, whereby these people select those keywords that will provide accurate and quick content-based access to the data collection. With the increasing amount of information available on the internet and the rapid growth of domain specific intranets, manual indexing is getting slower and more expensive. Automatic indexing on the other hand is faster and cheaper and can deal with the vast amounts of information. Automatic indexing processes huge amounts of information in a feasible time and with least effort. Our project is examining both the efficiency and the accuracy of keywords automation. In our work we are testing the capacity and quality of automatic indexing using a controlled vocabulary (thesaurus) called HASSET (Humanities and Social Science Electronic Thesaurus). HASSET is taken as the vocabulary source for the automatic indexing task. Automatic indexing is being applied to the UK Data Archive’s collection. The automatic indexing will provide a ranked list of candidate keywords to the human expert for final decision-making. The accuracy or effectiveness of the automatic indexing will be measured by the degree of overlap between the automated indexing decisions and those originally made by the human indexer (gold standards). This blog describes the work we have undertaken so far in the automatic indexing task. Further blog posts will be published once our results are ready.
2. Problem and Motivation
For many years the indexing process of the Archive’s collection has been done manually by professional human indexers. The process begins with the indexer going through study documentation, usually the questionnaire or interview schedule, though accompanying quantitative data files are also checked for derived variables not covered in the questionnaire. Keywords that represent the topics covered by the study are chosen and their best match is selected from the HASSET thesaurus Attention is paid to terms used over time within data series and across similar studies to ensure consistency of subject/keyword coverage within the collection. The processing time varies considerably depending on the size and complexity of the study. The keywords selected from HASSET stand as the high quality (gold) standard that can be used as training data for automatic indexers and as a comparator. As a result of the indexing process, new keywords may be recommended for addition to HASSET if needed.
3. Manual and Automatic Indexing
Indexing has been used for a long time as an efficient means of accessing information (i.e. indexes in books, telephone directories, libraries, etc.). Indexing can be done either manually or automatically. Manual indexing is a time consuming process. Automatic indexing, therefore, has been widely used in experimental systems such as information retrieval and information extraction [Croft et al. 2009]. Automatic indexing, in recent years, has shown to be labour-saving, more efficient and consistent than manual indexing [White et al. 2012, Hliaoutakis et al. 2006].
As shown in Figure 1, the automatic indexing process starts with automatically selecting documents from the collection set (documents that are not yet indexed). During this process information about the document is sent to the index (i.e. database), including the document’s title, size, location and genre if present. The selected documents are split into sentences using delimiters (e.g. full-stop, question-mark, and exclamation-mark). Finally those sentences are split into tokens (i.e. words) based on delimiters (e.g. white space). The tokens are indexed and information about each token’s location, position, frequency and weight are recorded. In our work the indexing process works with a controlled vocabulary. Controlled vocabulary mandate the use of predefined, authorised terms that have been preselected by the designer of the vocabulary (i.e. HASSET), in contrast to natural language vocabularies, where there is no restriction on the vocabulary.
Figure 1 Index Creation Process
4. Data Collection
The data collection used in our work is that from the UK Data Archive. This project will not look at the quantitative data collection themselves but rather at the contextual textual documents that accompany them. The collection provides documents and their gold standard keywords to be used for training and testing the automatic indexer. The material includes:
- A bank of variables/questions (individual variables from Nesstar indexed, each with HASSET terms specific to themselves).
- Survey Question Bank (SQB) questionnaires.
- ESDS data catalogue records.
- Abstracts (from all catalogue records).
- Full catalogue records (dating from 2005).
- Other full-text documents.
5. Data Collection Pre-processing
Automatic indexing requires a number of steps before the process can start. These steps are document formatting and metadata extraction. The first step converts the data collection from portable document format (PDF) to unstructured text. This is done by running a PowerShell wrapper script of the Linux open source “Xpdf” software. The script uses “pdftotext” converter to crawl through the data collection and converts the content of PDF files into text format “.txt”.
The second phase extracts metadata from the PDF data collection. The process includes using “Xpdf” software through applying “pdfinfo” script. We have used pdfinfo as part of a PowerShell script called “extractTAGS.ps1” to extract the metadata tags and store them in txt and xml file formats. The script extracts metadata tags such as the document’s title, keywords, producer, modification and creation date, etc. In our work we are mainly interested in the keywords attached to the data collection as those represent the gold standard. The extracted keywords are saved into a “.key” file format preserving the original file name as with the “.txt” files. The script, extractTAGS.ps1 was written as part of this work. Figure 2 shows the automatic indexing overall process. Figure 3 shows the system deployment process (all of the activities that make a software system available for use) of applying automatic indexing where no training and evaluation are required. Both the overall and the deployment processes start with getting the pdf files and converting them into text. Extracting the manual keywords is needed for the automatic and manual evaluation steps as in Figure 2. Automatic indexing refers to the selection of keywords from a document by a computer to be used as index entries (see Section 3).
Figure 2 Automatic Indexing Overall Process
Figure 3 Automatic System Deployment Process
6. Experimental Work
In our work so far we have investigated two text mining techniques to automatically index the data collection. The techniques used are the TF.IDF model and Keyphrase Extraction Algorithm (KEA) [Medelyan and Witten, 2006].
The work began by applying SKOS (Simple Knowledge Organization System) to HASSET.
We have taken SKOS-HASSET and applied it to a selected, representative part of the ESDS collection (see Section 4) to test its automatic indexing capabilities.
6.1 TF.IDF Model
One of the most common models to compute term weight is the tf.idf weighting, which makes use of two factors: the term frequency tf and the inverse document frequency idf [Salton and McGill, 1986]. The weight w of term j in a document i can be computed as:
and tfij , the term frequency, is the number of times that term j occurs in document i, n is the number of documents in the collection, and dfj is the document frequency of term j [Grossman and Frieder, 2004].
The information in the index will be used in the ranking process where information about each token’s (or a subsequence of words) weight will be used to calculate the similarity. In general, weights reflect the relative importance (based on factors such as the term-frequency of this token) of the extracted tokens in documents.
We used the tf.idf model to extract keywords from the SQB questionnaires (see Section 4). We processed 2436 SQB documents. We did not use any training data nor did we use a controlled vocabulary as the tf.idf model is a basic model that does not require training. The reason behind using tf.idf is the fact that no training data was needed. We asked the system to extract keywords limiting the number of subsequence words in each keyword to three (e.g. “British Conservative Party”). Although HASSET terms can contain up to five words (e.g. “LIBRARY, INFORMATION AND ARCHIVAL SCIENCES EDUCATION”) we were not able to process more than three subsequent words as this could result in a time infeasible process.
Indexing a domain specific data collection (i.e. Economic and Social Science) using tf.idf model resulted in extracting keywords that carry low domain-specific information content. When mapping the extracted keywords to the HASSET terms we had a very few matches. For example, the tf.idf system failed in finding matches to the keyword “Liberal Party” although it exists in HASSET but in a different form “BRITISH LIBERAL PARTY”, “LIBERAL PARTY (GREAT BRITAIN)”. To overcome this problem we used domain-specific keywords indexing with controlled vocabulary using HASSET thesaurus. Another way round this would be to match sub-parts of HASSET terms, or ignoring qualifier of HASSET terms (i.e. the part in brackets). This way the process extracts keywords that only exist in the HASSET thesaurus.
6.2 Keywords Indexing using Controlled Vocabulary
Indexing documents requires selecting the keywords from the controlled vocabulary that best describes a document. Automatic indexing requires training data consisting of documents and their associated keywords. The training data are used to build a classifier model for each keyword. The model is then applied to new (previously unseen) documents to assign keywords. The training data set is chosen based on the percentage of keywords’ coverage when compared with HASSET terms.
In our work we used the Keywords Extraction Algorithm (KEA). Figure 4 illustrates the keywords extraction process, which contains two stages (1) training and (2) keyword extraction (testing). Figure 5 shows the system deployment process where no training and testing are needed. The models created during the training phase are used in the extraction phase. The testing data are now the actual data that need to be automatically indexed to extract keywords. The algorithm is based on machine learning and works in two main steps: candidate term identification, which identifies thesaurus terms that relate to the document’s content; and filtering, which uses a learned model (using our training data) to identify the most significant keywords based on certain properties or “features” which includes the tf.idf measure, the first occurrences of a keyword in a document, the length of keywords (e.g. two words) and node degree (the number of keywords that are semantically related to a candidate keyword) [Witten et al., 2005]. The machine learning scheme first builds a model using training documents with known keywords, and then uses the model to find keywords in new documents. KEA uses the latest version of the Weka machine learning workbench, which contains a collection of visualisation tools and algorithms for data analysis and predictive modelling [Witten and Frank, 2000]. Weka supports several standard data mining tasks, more specifically, data pre-processing, clustering, classification, regression, visualization, and feature selection. Both KEA and Weka are open source software under the GNU General Public Licence. In our work we rely on Weka’s feature selection functionality which is used by KEA to measure the importance of a certain keyword.
Figure 4 Keywords Extraction Process using KEA
Figure 5 Keywords Extraction System Deployment using KEA
6.2.1 KEA Algorithm
The keyword extraction process has two stages [Witten et al., 2005]:
1- Training: create a model for identifying keywords, using training documents where the human indexer’s keywords are known. We use the UK Data Archive’s questionnaires (SQBs). We perform a number of runs where we try different training sets (based on the percentage of keywords coverage) in each run. The total number of the SQB documents is 2436. We have selected 80% (1948 documents) of the data collection for training. And another 20% (488 documents) for testing. The controlled vocabulary used is SKOS-HASSET (see Section 6). A common English stop word list has been augmented with domain specific terms.
2- Keyword Extraction: choose keywords from a new (test) document.
Figure 6 shows the training and extraction stages. Both stages choose a set of candidate keywords from the input documents and then calculate the values of the features mentioned earlier. The example shown in the figure is taken from agricultural documents using the Agrovoc thesaurus.
Figure 6 KEA Training and Extraction Processes
KEA chooses candidate keywords in three steps [Witten et al., 2005]:
1- Input Cleaning
· Replace punctuation marks, brackets and number with phrase boundaries.
· Remove apostrophes.
· Split hyphenated words into two.
· Remove non-token characters (that do not contain letters).
2- Candidate Identification
· Candidate phrases are limited to a certain maximum length (usually three words); unlike the tf.idf experiment (see Section 6.1) using KEA we set a maximum length of five words.
· Candidate phrases cannot be proper names (i.e. single words that only ever appear with an initial capital).
· Candidate phrases cannot begin or end with a stop-word.
3- Phrase stemming and case-folding (the distinction between upper and lower-case)
Stemming determines the morphological stem of a given inflected word. It uses morphological rules or heuristics to remove affixes from words before indexing. Stemming reduces the number of indexing terms and reduces redundancy (by collapsing together words with related meanings). Stemmers are common elements in information retrieval systems, since a user who runs a query on ”computers” might be also interested in documents that contain the word “computer” or “computerization” [Croft et al., 2009]. Stemming is only applied to the document’s terms, where terms with related meanings are grouped together.
KEA uses Weka to calculate four features for each candidate phrase. Which are used in the training and extraction process. The features are “TF.IDF” and “First Occurrence”, “Length” and “Node degree”.
· TF.IDF compares the frequency of a phrase’s use in a particular document with the frequency of the phrase in general use (see TFIDF Method).
· First Occurrence is calculated as the number of words that precede the phrase’s first appearance, divided by the number of words in the document. The result is a number between 0 and 1 that represents how much of the document precedes the phrase’s first appearance.
· Length of a keyword is the number of its component words.
· Node degree of a candidate keyword is the number of keywords in the candidate set that is semantically related to this keyword. This is computed with the help of the thesaurus. Phrases with high degree are more likely to be keyphrases.
Training: building the model
The training stage uses a set of training documents for which the author’s keywords are known. For each training document, candidate phrases are identified and their feature values are calculated as described above. Each phrase is then marked as a keyword or a non-keyword, using the actual keywords that have been manually assigned to the document. This binary feature is the class feature used by the machine learning scheme.
Extraction of new keywords
To select keywords from a new document, KEA determines candidate phrases and feature values, and then applies the model built during training. The model determines the overall probability that each candidate is a keyword, and then a post-processing operation selects the best set of keywords using a Naïve Bayes model (a simple probabilistic model based on applying Bayes’ theorem (from Bayesian statistics) with strong (naive) independence assumptions) [Chen et al. 2009].
Keyword quality is measured by counting the number of matches between KEA’s output and the keywords that were originally chosen by the professional human indexer. We also use recall and precision to assess the effectiveness of the automatic keyword extraction process.
Precision: the fraction of retrieved instances that is relevant. Precision can be interpreted as the number of keywords judged as correct divided by the total number of keywords assigned. Recall: the fraction of relevant instances that are retrieved. Recall can be interpreted as the number of correct keywords assigned, divided by all keywords deemed relevant to the object. F1 score (or F-measure, F-score): F1 score considers both recall and precision to measure accuracy. It can be interpreted as a weighted average of the precision and recall (best value 1, worst 0).
The system was trained using 1948 SQB documents. Table 1 shows the preliminary recall, precision and F1 scores of applying KEA algorithm on 488 test SQB documents. The recall and precision were measured by comparing the extracted keywords with the keywords that have been manually assigned to the document.
As shown in the table, increasing the number of automatically extracted keywords increases the recall but, at the same time, affects the precision.
Table 1 Keyword Extraction Scores for SQB Documents
|Number of Keywords||Recall||Precision||F-Measure|
Additional evaluation methods, as described in our evaluation plan, will be applied as the next step in our work. We are in the process of cleaning the data, which will, hopefully, help in producing more accurate results. We will also compare our results with those achieved by third parties, where possible and appropriate.
In this work we test SKOS-HASSET’s automated indexing capacity. SKOS-HASSET was taken as the terminology source for an automatic indexing tool and applied to question text, abstracts and publications from the Archive’s collection. The results were compared to the gold standard of manual indexing. Limitations of this approach include the small-sized training data. The use of HASSET thesaurus vocabulary limits the approach to data that falls in the Humanities and Social Science domains, but this can be solved by including vocabularies from other domains.
Future blog posts will be published as this work package. This customer blog post will also form the basis for our automated indexing exemplar / case study.
A. Hliaoutakis, K. Zervanou, E. G.M. Petrakis, and E. E. Milios. 2006. Automatic document indexing in large medical collections. In Proceedings of the international workshop on Healthcare information and knowledge management (HIKM ’06). ACM, New York, NY, USA, 1-8. DOI=10.1145/1183568.1183570 http://doi.acm.org/10.1145/1183568.1183570
B. Croft, D. Metzler, and T. Strohman. Search Engines – Information Retrieval in
Practice. Pearson Education, 2009. ISBN 978-0-13-136489-9.
D. Grossman and O. Frieder. Information Retrieval: Algorithms and Heuristics. The Kluwer International Series of Information Retrieval. Springer, second edition, 2004.
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986. ISBN 0070544840.
H. White, C. Willis, and J. Greenberg. 2012. The HIVE impact: contributing to consistency via automatic indexing. In Proceedings of the 2012 iConference (iConference ’12). ACM, New York, NY, USA, 582-584. DOI=10.1145/2132176.2132297 http://doi.acm.org/10.1145/2132176.2132297
I.H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco.
I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Y.-L. Theng and S. Foo, editors, Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pages 129–152. Information Science Publishing, London, 2005.
J. Chen, H. Huang, S. Tian, Y. Qu, Feature selection for text classification with Naïve Bayes, Expert Systems with Applications, Volume 36, Issue 3, Part 1, April 2009.
O. Medelyan , I.H. Witten, Thesaurus based automatic keyphrase indexing, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA [doi>10.1145/1141753.1141819]