As our community increase the pace of implementing mCODE-based applications for a broadening number of use cases, CodeX aims to make available test datasets to support building, testing, demonstrating, piloting and adopting these applications.
As a first set of resources, the Synthea™ patient health record simulator and associated tools have been used to generate synthetic cancer patient mCODE records. We are also looking into providing properly de-identified records in mCODE format.
Anyone is welcome to leverage the resources below. Please let us know what you think and what else we can do to accelerate your efforts.
Synthea Synthetic Data for Testing mCODE-based applications
The following datasets were produced (March 6, 2020) using Synthea™ an open-source patient population simulation made available by The MITRE Corporation:
Approx. 200 Lifetime/Longitudinal Patient Records: https://mcodeapp.org/testdata/mcode1_0_longitudinal.zip
- Female breast cancer patients: 179
- Male breast cancer patients: 14
- Assorted other cancer (lung, colorectal, prostate) patients: 6
- 82 Mbytes, compressed
Approx. 2,000 Patient Records with 10 Years of Medical History: https://mcodeapp.org/testdata/mcode1_0_10yrs.zip
- Female breast cancer patients: 1,853
- Male breast cancer patients: 19
- Assorted other cancer patients: 211
- 215 Mbytes, compressed
The following datasets were produced using Synthea on December 19, 2020. They contain patients who have been diagnosed with diffuse large B-cell lymphoma (DLBCL):
Approx. 400 Lifetime/Longitudinal Patient Records: https://mcodeapp.org/testdata/mcode1_0_dlbcl_longitudinal.zip
- Female DLBCL Patients: 152
- Male DLBCL Patients: 276
- 185 Mbytes, compressed
Approx. 4,000 Patient Records with 10 Years of Medical History: https://mcodeapp.org/testdata/mcode1_0_dlbcl_10yrs.zip
- Female DLBCL Patients: 1,660
- Male DLBCL Patients: 2,458
- 510 Mbytes, compressed
The datasets are free of cost, privacy, and security restrictions. They can be used without restriction for a variety of secondary uses in academia, research, industry, and government. Even though these synthetic patients are intended to reflect mCODE, they contain a patient's entire record, including non-cancer-related encounters, conditions, medications, etc.
Because of the way that Synthea outputs FHIR records, it is not possible at this time to output mCODE patients directly out of Synthea. So these patients have been post-processed using the fhir-mapper.
Note: Synthea generates synthetic patients based on modules representing the progression and treatment of diseases and other conditions. This means that the fidelity and variance of each patient's journey, as well as the data elements captured, is limited by those modules. The breast cancer module within Synthea is one of the most advanced. However the program still cannot capture the full complexity of a condition like breast cancer. Notable concepts in mCODE which are not yet represented: pathologic staging, genetics/genomics, metastasis.
Most importantly, this data cannot be used for clinical discovery.