Personalized medicine divides patients into increasingly fine categories—clinical phenotypes—to enable tailored diagnosis and treatment.
Common Data Models standardize healthcare data so multiple institutions can combine and analyze it for research on specific diseases, treatments, outcomes, and more.
Healthcare systems store and label data differently. The same concept can be named in different ways even within a single organization, and certainly across countries. Without standardization, queries are inconsistent and data integration is hard.
CDMs define common structures and fields so the same kinds of data live in the same place, regardless of source system.
Widely used open-source research CDMs include:
There is no single “best” CDM—each has strengths and trade-offs depending on your use case, data shape, and query patterns.
OMOP is the CDM I focus on here because I work with it regularly. It has an active community and solid documentation.
OMOP = Observational Medical Outcomes Partnership.
It is governed by the OHDSI community (Observational Health Data Sciences and Informatics) at Columbia University, New York.
OMOP has broad global adoption. Estimates suggest data for roughly 1.4 billion individuals have been mapped to OMOP worldwide.
The OMOP CDM is primarily relational.
Terminology differences across languages and sources are handled via standardization. For example, ICD‑10 (English) vs CIM‑10 (French) are mapped to a common standard such as SNOMED CT.
DDL scripts to create the model are available here: OHDSI/CommonDataModel. These SQL scripts create the standard tables of the OMOP CDM.
OMOP distinguishes two sides: source and standard.
Source
source_
(e.g., source_value
, source_concept_id
)Standard
concept_id
)All source terms should be mapped to standard concepts. This allows multiple data sources to converge on a shared, community‑understood vocabulary.
You can search and download standardized vocabularies and mappings in Athena (official OHDSI vocabulary portal): https://athena.ohdsi.org/search-terms/start
Hierarchical terminologies enable powerful roll‑ups and drill‑downs.
Example hierarchies:
For medications, OMOP uses RxNorm as the primary standard, provided by the U.S. National Library of Medicine.
OMOP also includes additional medication hierarchies. One example is NDF‑RT (National Drug File – Reference Terminology), which organizes medications by related diseases/indications.
SOURCE_VALUE
: original, unmodified values (useful for debugging; zero loss)CONCEPT_ID
: standardized OMOP concept identifiers (enables local ↔ network harmonization)Key concepts:
SOURCE_VALUE
/ SOURCE_CONCEPT_ID
: original source values (e.g., from MIMIC‑III). These preserve what was in the upstream system.CONCEPT_ID
: standardized OMOP concept identifiers. Use these for standardized, cross‑site queries.CONCEPT
table: the vocabulary dictionary that maps concept_id
↔ human‑readable names and related metadata.CONCEPT_ANCESTOR
table: stores concept hierarchies, linking a higher‑level (parent/ancestor) concept to all of its descendant concepts for roll‑up queries.Example hierarchy:
OMOP uses broad, standardized table and field names so it can generalize across many health systems worldwide. A few examples:
VISIT_OCCURRENCE
= encounters/admissions/visits (table spec: CDM 5.4 VISIT_OCCURRENCE)CONDITION_OCCURRENCE
= diagnoses/conditions (table spec: CDM 5.4 CONDITION_OCCURRENCE)DRUG_EXPOSURE
= medication prescribing/dispensing/administration (depending on the source) (table spec: CDM 5.4 DRUG_EXPOSURE)These umbrella names intentionally cover many similar events (e.g., external consults, urgent care, inpatient stays) under a single “visit” construct. The same idea applies to other domains.
Using hierarchies can drastically simplify queries: instead of listing hundreds or thousands of specific concepts, you can select a single ancestor concept_id
and include all of its descendants via CONCEPT_ANCESTOR
.