Semi-automated data extraction workbench for environmental health

Article type
Authors
Howard B1, Maharana A1, Tandon A1, Albert T1, Phillips J1, Taylor M2, Thayer K2, Shah R1
1Sciome LLC
2Environmental Protection Agency
Abstract
Background: Systematic review, already a cornerstone of evidence-based medicine, has begun to gain significant popularity in several related disciplines including environmental health and evidence-based toxicology. A critical, time-consuming process that occurs during systematic review is the extraction of relevant qualitative and quantitative raw data from the text of scientific documents. The specific data extracted differs among disciplines, but within a given domain, certain data points are extracted repeatedly for each review that is conducted.

Methods: We have recently developed a semi-automated data extraction workbench for use in this context. Our research has focused on three specific goals. First, we are using deep learning to build novel data extraction components for items of interest within the domain of environmental health. Second, we have created web-based software specifically designed for extraction in the context of systematic review. Finally, we have introduced new protocols to standardize the inputs and outputs for data extraction software components.

Results: A beta version, currently under evaluation at EPA, includes more than 30 novel data extraction components relevant to environmental toxicology. Performance varies widely among data types with some tasks inherently more difficult than others. For certain simple data items, like sex of the experimental animal, we achieve F-scores in excess of 95%; for more difficult entities, we were still often able to achieve an F-score of 65% or more, given sufficient training data. Importantly, the design of our workbench makes it easy to include extraction components developed by other research groups. The workbench currently includes several such components, with new ones added regularly.

Conclusions: Because accurate data extraction is a challenging problem, and given that current methods rarely achieve 100% accuracy, we are integrating our methods into a “human-in-the-loop” system that combines machine and human intelligence in a manner that is superior to using either in isolation. The system will: highlight extracted terms in a pdf; automatically populate forms with extracted data; allow humans to intervene and correct the results; and learn from the corrections to continually update the model. The resulting system will make systematic reviews both more efficient to produce and less expensive to maintain, greatly accelerating the process by which scientific consensus is obtained in a variety of health-related disciplines.

Patient or healthcare consumer involvement: The resulting system will make systematic reviews both more efficient to produce and less expensive to maintain, greatly accelerating the process by which scientific consensus is obtained in a variety of health-related disciplines.