A hybrid approach for automating citation screening process

2013 Québec City

Zhang D¹, Lei J¹, Robinson KA¹

¹Johns Hopkins School of Medicine, USA

Background: Building a classificationmethod to facilitate screening of search results is an explicit way to enable more efficiency in systematic review process. We propose to use a hybrid approach to optimize the collection of features used to characterize the citations (feature space) by combining Independent Component Analysis (ICA) and Sequential Forward Floating Search (SFFS), and by using Support Vector Machine (SVM), Perceptron voting (VP), and BayesNet (BN) for both comparison and achieving the best outcome as possible.

Objectives: To optimize the feature space by utilizing ICA and SFFS, as well as through the comparison and adjusting the results from SVM, VP and BN.

Methods: We used the search results and listings of eligible studies from three systematic reviews: First, we built three feature spaces: (i) MeSH terms (ii) title keywords (iii) keywords from abstracts. We used ICA to extract 500 ‘relevant’ feature types from 5000+ types among these three spaces mentioned above for three projects; Further we used modified SFFS method to select ‘most relevant’ feature type for individual project, through machine training/test with SVM, BN, and VP; Then we used SVM, VP and BN for classifying the citation collections. For comparison purpose, we also run the process without ICA process, and without both ICA/SFFS process.

Results: The preliminary result shown the sensitivity has increased to 90.23%, 84.67% and 88.02% for each projects from 55.02%, 58.4%, and 53.21% after ICA/SPSS optimization. The specificity rates are from 56.32%, 67.34%, and 84.34% to 77.98%, 80.04% and 83.74% after ICA/SFFS optimization. High sensitivity means we do good job including ‘right’ documents; High specificity means we do good job excluding ‘wrong’ documents so we don’t waste time.

Conclusions: The preliminary results shown that optimizing the feature space is an important route to improve classification. With improvement, we could achieve 100% sensitivity while still maintain high specificity.