Semi-automated data extraction workbench for environmental health

2020 Abstracts

Howard B¹, Maharana A¹, Tandon A¹, Albert T¹, Phillips J¹, Taylor M², Thayer K², Shah R¹

¹Sciome LLC

²Environmental Protection Agency

Background:

Systematic review, already a cornerstone of evidence-based medicine, has begun to gain significant popularity in several related disciplines including environmental health and evidence-based toxicology. A critical, time-consuming process that occurs during systematic review is the extraction of relevant qualitative and quantitative raw data from the text of scientific documents. The specific data extracted differs among disciplines, but within a given domain, certain data points are extracted repeatedly for each review that is conducted.

Methods:

We have recently developed a semi-automated data extraction workbench for use in this context. Our research has focused on three specific goals. First, we are using deep learning to build novel data extraction components for items of interest within the domain of environmental health. Second, we have created web-based software specifically designed for extraction in the context of systematic review. Finally, we have introduced new protocols to standardize the inputs and outputs for data extraction software components.

Results:

A beta version, currently under evaluation at EPA, includes more than 30 novel data extraction components relevant to environmental toxicology. Performance varies widely among data types with some tasks inherently more difficult than others. For certain simple data items, like sex of the experimental animal, we achieve F-scores in excess of 95%; for more difficult entities, we were still often able to achieve an F-score of 65% or more, given sufficient training data. Importantly, the design of our workbench makes it easy to include extraction components developed by other research groups. The workbench currently includes several such components, with new ones added regularly.

Conclusions:

Because accurate data extraction is a challenging problem, and given that current methods rarely achieve 100% accuracy, we are integrating our methods into a “human-in-the-loop” system that combines machine and human intelligence in a manner that is superior to using either in isolation. The system will: highlight extracted terms in a pdf; automatically populate forms with extracted data; allow humans to intervene and correct the results; and learn from the corrections to continually update the model. The resulting system will make systematic reviews both more efficient to produce and less expensive to maintain, greatly accelerating the process by which scientific consensus is obtained in a variety of health-related disciplines.

Patient or healthcare consumer involvement: The resulting system will make systematic reviews both more efficient to produce and less expensive to maintain, greatly accelerating the process by which scientific consensus is obtained in a variety of health-related disciplines.