Data management in real-world study

Article type
Authors
Li J1, Han M1, Shen C1, Yang M1, Sun J2, Liu J1
1Beijing University of Chinese Medicine
2Nanjing University of Chinese Medicine
Abstract
Background: there are still many limitations in clinical trials: the are slow and costly; their external validity is limited; it is difficult for patients to participate in them. And real-world data can be used to make up these deficiencies, which promote the generation of clinical evidence. However, due to the diversity of real-world data, researchers spend more than 80% of time on data management throughout the study. This research aims to explore the data management process by reviewing articles and summing up project management experience.

1) Data management process
- A general assessment of data - data integrity: we use a case report form (CRF) to determine if any variables are missing. Original medical records (unstructured data) need to be assessed by keyword searching. After the assessment, we send integrity reports to data providers, which is helpful to supplement data.
- Writing criterion: the writing criterions change according to time and vary according to hospitals, as do the writing styles of medical cases. These differences should be recorded in a timely way, in order to further adjust the specific rules of other hospitals.

2) Data screening and merging
- Data screening: the purpose of setting this process is to remove the data that do not meet the inclusion criteria. This can be implemented with both machine and manual methods: firstly, the knowledge repository (dictionary) relied on by the screening process should be established at the very beginning, which is built according to the diseases studied. Then, the unstructured data are marked by the dictionary-based matching method of natural language processing. Secondly, according to the judgment criteria, we establish a checklist composed of keywords to screen the data.
- Data merging: according to merge rules used by different hospitals, the multiple information of individual cases should be merged into one package. Data screening and merging reports are sent to data providers in order to reach a consensus on the data included in the study.

3) Data extraction
We extract variables from original medical records by natural language processing and knowledge map. (Figure 1)

4) Verify structured variables
- Researchers sample about 5% of cases and compare the extracted variables with the original medical records.
- All cases need to be verified against each variable to discover and summarize problems. A dual check should be performed with data extraction staff and other personnel.

5) Data classification
We extract the values of variables. We discuss those not within the specified range of the CRF and classify them, and then re-assign them.

Patient or healthcare consumer involvement: real-world evidence incorporates as much patient data as possible. More and more patients will benefit when all these data are used to improve the ability of clinical decision-making and fulfil patients' needs.