An open competition involving thousands of competitors failed to construct useful search filters for new diagnostic test accuracy systematic reviews

Article type
Authors
Kataoka Y1, Taito S2, Yamamoto N3, So R4, Tsutsumi Y5, Anan K6, Banno M7, Tsujimoto Y8, Wada Y9, Sagami S10, Tsujimoto H11, Nihashi T12, Takeuchi M13, Terasawa T14, Iguchi M15, Kumasawa J16, Kasuga Y17, 999 R17, Yamabe J17, Furukawa TA18
1Kyoto Min-iren Asukai Hospital
2Division of Rehabilitation, Department of Clinical Practice and Support, Hiroshima University Hospital
3Department of Orthopedic Surgery, Miyamoto Orthopedic Hospital
4Department of Psychiatry, Okayama Psychiatric Medical Center
5Department of Emergency Medicine, National Hospital Organization Mito Medical Center
6Division of Respiratory Medicine, Saiseikai Kumamoto Hospital
7Department of Psychiatry, Seichiryo Hospital
8Oku medical clinic
9Department of Rehabilitation Medicine I, Fujita Health University School of Medicine
10Center for Advanced IBD Research and Treatment, Kitasato University Kitasato Institute Hospital
11Hospital Care Research Unit, Hyogo Prefectural Amagasaki General Medical Center
12Department of Radiology, National Center for Geriatrics and Gerontology
13Department of Emergency and General Internal Medicine, Fujita Health University School of Medicine
14Section of General Internal Medicine, Department of Emergency and General Internal Medicine, Fujita Health University School of Medicine
15Department of Neurology, Fukushima Medical University
16Department of Critical Care Medicine, Sakai City Medical Center
17Independent researcher
18Department of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine/School of Public Health
Abstract
Background:
No abstract classifier can be used for new diagnostic test accuracy (DTA) systematic reviews to select primary DTA study abstracts from database searches.

Objectives:
Our goal with the FILtering of diagnostic Test accuracy studies (FILTER) Challenge was to develop machine learning (ML) filters for new diagnostic test accuracy (DTA) systematic reviews through an open competition.

Methods:
We conducted an open competition. We prepared a dataset including titles, abstracts, and the judgement sought to retrieve full texts from 10 DTA reviews and a mapping review. We randomly split the datasets into a train set (n = 27145, labeled as DTA n= 632), a public test set (n = 20417, labeled as DTA n= 474), and a private test set (n = 20417, labeled as DTA n= 469). Competition participants used the training set to develop models, then they validated their models using the public test set to refine their development process. Finally, we used the private test set to rank the submitted models. We used Fbeta with beta set to seven to honor models. For the external validation, we used a DTA review for the cardiology dataset (n = 7722, labeled as DTA n= 167). We preset Fbeta adopted seven as the value of beta and recall to evaluate models for evaluating better filters that are less likely to miss.

Results:
From July 28 to October 4, 2021, we held the challenge. We received a total of 13,774 submissions from 1,429 teams or persons. We honored the top three models. Fbeta scores and Recall in the external validation set were 0.4036 and 0.2352 by the first model, 0.3262 and 0.3313 by the second model, and 0.3891 and 0.3976 by the third model, respectively.

Conclusions:
We were unable to develop a search filter with sufficient recall to apply for new DTA reviews immediately. Further studies are needed to update and validate filters with datasets from other clinical areas.

Patient, public, and/or healthcare consumer involvement: None.