Investigating the modelling of large routine datasets in primary care

Article type
Authors
Willis B1, Nirantharakumar K1, Toulis K1, Trikalinos T2
1University of Birmingham
2Brown University
Abstract
Background:
Epidemiological studies using vast primary care datasets are becoming increasingly prevalent. The open cohort design is commonly used in such analyses where the exposed are often matched to controls. This raises several issues including whether matching is necessary, whether covariate adjustment is required after matching and which regression model(s) should be used.

Objective:
To investigate the effectiveness of matching, covariate adjustment and different regression models for modelling large routine datasets from primary care.

Methods:
Using an example from the literature to provide the parameters, we simulated an open cohort study by simulating survival times from a Weibull distribution that included four covariates (two Bernoulli and two continuous) for a dispersed distribution and an undispersed distribution. We modelled different combinations of the covariates in terms of association and functional form. For population sizes ranging from 10,000 to 1,000,000 and an exposure factor of 1% prevalence we considered matched and random selections of the controls. We evaluated the simulated data using Cox regression, Poisson regression and conditional logistic regression. We compared the parameter estimates for the exposure factor from each of the models in terms of bias, mean square error and coverage probabilities.

Results:
Poisson regression produced the most biased estimates in the dispersed and undispersed populations. In many cases selecting matched controls confers no advantage over controls selected randomly providing covariates are adjusted for. However, when the functional form of one or more covariates is not known and associated with the exposure, matched samples produced the least biased estimates. Overall conditional logistic regression and Cox regression applied to matched sample data produced the least biased estimates.

Conclusions:
Conditional logistic regression and Cox regression with adjustment for matched covariates is recommended in the majority of cases. As the population becomes more dispersed a greater number of exposed are required to reduce bias.