The automatic identification of empirical research articles

Article type
Year
Authors
Haas SW, Sugarman J, Tibbo HR, Sugarman J
Abstract
Introduction: Retrieving documents from electronic databases on dimensions other than topic can be problematic, yet for many users this ability may be crucial in finding the documents they seek. For instance, distinguishing the empirical research literature from other types of literature on a particular topic may be important to users, but is difficult, if not impossible in current databases. Full text electronic databases open the possibility of using text-filtering strategies to make such distinctions.

Objective: To develop a method of automatically identifying articles reporting empirical research in the multidisciplinary field of bioethics, where a variety of empirical and theoretical methods are used to investigate particular issues.

Methods: Several electronic databases (Ageline, Bioethicsline, Health Planning and Administration, Medline, Philosophers' Index, and Psyclnfo) were searched to identify literature on the topic of "advance directives". The bioethicist on our team (JS) selected 30 articles from this search to represent a variety of authors and journals, and to provide 15 articles describing empirical research, and 15 non-empirical articles. The full text of each article was entered into an electronic format and examined to determine if a rule based on their vocabulary could classify the articles as empirical or non-empirical.

Results: The articles' vocabulary was analyzed to find words that appeared exclusively or predominantly in one set and not in the other. A simple rule based on the presence of the words "table" or "sample" successfully differentiated the empirical from the non-empirical articles. This is an important result for three reasons: (1) it was 100% accurate in this study; (2) it can be applied in existing full text databases; (3) since the words do not refer to the specific content of an article, but rather to its genre, it is likely that this type of rule will be effective in other disciplines.

Discussion: This study demonstrates that simple decision rules based on vocabulary can automatically classify articles as empirical or non-empirical. It is conceivable that a similar approach might be used to identify articles based upon particular types of empirical methods employed, such as the randomized clinical trial.