Background: Machine learning could improve the efficiency of conducting literature reviews by automating review of titles and abstracts for relevant articles.
Objectives: This study investigated if it is feasible to use the Fisher classification method to distinguish between relevant and irrelevant articles for literature reviews based on their titles and abstracts, and to determine if success varies depending on the type of classification being performed.
Methods: Datasets were created from abstract lists from three systematic reviews. To explore if the algorithm performed better on particular types of classification (by study design, by outcomes measured, by interventions used or by disease area), decisions at various points of the eligibility flowcharts were tested separately. Articles were labelled as ‘relevant’ or ‘irrelevant’ at each of these stages. The datasets were processed to remove duplicates and to adjust for imbalances in ‘relevant’:‘irrelevant’ abstracts as a possible confounder. Articles with only a title or only an abstract were retained.
Each dataset was divided into training (60%), cross-validation (20%) and test sets (20%). Accuracy was measured using classification accuracy and the F2 score which favours correct classification of ‘relevant’ items. After training the classifier algorithm, we optimised the F2 score on the cross-validation set by adjusting the thresholds at which a ‘relevant’ or ‘irrelevant’ label was assigned. Items falling below these thresholds were marked as ‘uncertain’ by the algorithm and excluded from the F2 score calculation. The final F2 score was calculated on the test set.
Results: Classifiers were trained on six datasets with final F2 scores varying from 0.663 to 0.879 (Table 1). Classifications by study type and disease area outperformed those based on outcome measures or interventions.
Conclusions: The Fisher classification method was successful at classifying relevant and irrelevant articles based on their titles and abstracts, particularly for classifications based on study design and disease area. We intend to investigate the performance of other machine learning algorithms on this task.