Usage considerations for DICOM data transformation in healthcare data solutions
This article outlines key considerations to review before using the DICOM data transformation capability. Understanding these factors ensures smooth integration and operation within your healthcare data solutions environment. This resource also helps you effectively navigate some potential challenges and limitations with the capability.
Ingestion file size
Currently, there's a logical size limit of 8 GB for ZIP files and up to 4 GB for native DCM files. If your files exceed these limits, you might experience longer execution times or failed ingestion. We recommend splitting the ZIP files into smaller segments (under 4 GB) to ensure successful execution. For native DCM files larger than 4 GB, make sure you scale up the Spark nodes (executors) as needed.
DICOM tag extraction
We prioritize promoting the 29 tags present in the bronze lakehouse ImagingDicom delta table for the following two reasons:
- These 29 tags are crucial for general querying and exploration of DICOM data, such as modality and laterality.
- These tags are necessary for generating and populating the silver (FHIR) and gold (OMOP) delta tables in subsequent execution steps.
You can extend and promote other DICOM tags of your interest. However, the DICOM data transformation notebooks don't automatically recognize or process any other columns of DICOM tags that you add to the ImagingDicom delta table in the bronze lakehouse. You need to process the extra columns independently.
Append pattern in the bronze lakehouse
All newly ingested DCM (or ZIP) files are appended to the ImagingDicom delta table in the bronze lakehouse. For every successfully ingested DCM file, we create a new record entry in the ImagingDicom delta table. There's no business logic for merge or update operations at the bronze lakehouse level.
The ImagingDicom delta table reflects every ingested DCM file at the DICOM instance level and should be considered as such. If the same DCM file is ingested again into the Ingest folder, we add another entry to the ImagingDicom delta table for the same file. However, the file names are different due to the Unix prefix timestamp. Depending on the date of ingestion, the file might be placed within a different YYYY\MM\DD
folder.
OMOP version and imaging extensions
The current implementation of the gold lakehouse is based on Observational Medical Outcomes Partnership (OMOP) Common Data Model version 5.4. OMOP doesn't yet have a normative extension to support imaging data. Hence, the capability implements the extension proposed in Development of Medical Imaging Data Standardization for Imaging-Based Observational Research: OMOP Common Data Model Extension. This extension is the most recent proposal in the imaging research field published on February 5, 2024. The current release of the DICOM data transformation capability is limited to the Image_Occurrence table in the gold lakehouse.
Structured streaming in Spark
Structured streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The system ensures end-to-end fault-tolerance guarantees through checkpoints and Write-Ahead logs. To learn more about structured streaming, see Structured Streaming Programming Guide (v3.5.1).
We use ForeachBatch to process the incremental data. This method applies arbitrary operations and writes the logic on the output of a streaming query. The query is executed on the output data of every micro-batch of a streaming query. In the DICOM data transformation capability, structured streaming is used in the following execution steps:
Execution step | Checkpoint folder location | Tracked objects |
---|---|---|
Extract DICOM metadata into the bronze lakehouse | healthcare#.HealthDataManager\DMHCheckpoint\medical_imaging\dicom_metadata_extraction |
DCM files in the bronze lakehouse under Files\Process\Imaging\DICOM\YYYY\MM\DD . |
Convert DICOM metadata to the FHIR format | healthcare#.HealthDataManager\DMHCheckpoint\medical_imaging\dicom_to_fhir |
Delta table ImagingDicom in the bronze lakehouse. |
Ingest data into the bronze lakehouse ImagingStudy delta table | healthcare#.HealthDataManager\DMHCheckpoint\<bronzelakehouse>\ImagingStudy |
FHIR NDJSON files in the bronze lakehouse under Files\Process\Clinical\FHIR NDJSON\YYYY\MM\DD\ImagingStudy . |
Ingest data into the silver lakehouse ImagingStudy delta table | healthcare#.HealthDataManager\DMHCheckpoint\<silverlakehouse>\ImagingStudy |
Delta table ImagingStudy in the bronze lakehouse. |
Tip
You can use OneLake file explorer to view the content of the files and folders listed in the table. For more information, see Use OneLake file explorer.
Group pattern in the bronze lakehouse
Group patterns apply when you ingest new records from the ImagingDicom delta table in the bronze lakehouse to the ImagingStudy delta table in the bronze lakehouse. The DICOM data transformation capability groups all the instance-level records in the ImagingDicom delta table by the study level. It creates one record per DICOM study as an ImagingStudy, and then inserts the record into the ImagingStudy delta table in the bronze lakehouse.
Upsert pattern in the silver lakehouse
The upsert operation compares the FHIR delta tables between the bronze and silver lakehouses based on the {FHIRResource}.id
:
- If a match is identified, the silver record is updated with the new bronze record.
- If there's no match identified, the bronze record is inserted as a new record in the silver lakehouse.
We use this pattern to create resources in the silver lakehouse ImagingStudy table.
ImagingStudy limitations
The upsert operation works as expected when you ingest DCM files from the same DICOM study in the same batch execution. However, if you later ingest more DCM files (from a different batch) that belong to the same DICOM study previously ingested into the silver lakehouse, the ingestion results in an Insert operation. The process doesn't perform an Update operation.
This Insert operation occurs because the notebook creates a new {FHIRResource}.id
for ImagingStudy in each batch execution. This new ID doesn't match with IDs in the previous batch. As a result, you see two ImagingStudy records in the silver table with different ImagingStudy.id
values. These IDs are related to their respective batch executions but belong to the same DICOM study.
As a workaround, complete the batch executions and merge the two ImagingStudy records in the silver lakehouse based on a combination of unique IDs. However, don't use ImagingStudy.id
for the merge. Instead, you can use other IDs such as [studyInstanceUid (0020,000D)]
and [patientId (0010,0020)]
to merge the records.
OMOP tracking approach
The healthcare#_msft_omop_silver_gold_transformation notebook uses the OMOP API to monitor changes in the silver lakehouse delta table. It identifies newly modified or added records that require upserting into the gold lakehouse delta tables. This process is known as Watermarking.
The OMOP API compares the date and time values between {Silver.FHIRDeltatable.modified_date}
and {Gold.OMOPDeltatable.SourceModifiedOn}
to determine the incremental records that were modified or added since the last notebook execution. However, this OMOP tracking approach doesn't apply to all delta tables in the gold lakehouse. The following tables aren't ingested from the delta table in the silver lakehouse:
- concept
- concept_ancestor
- concept_class
- concept_relationship
- concept_synonym
- fhir_system_to_omop_vocab_mapping
- vocabulary
These gold delta tables populate using the vocabulary data included in the OMOP sample data deployment. The vocabulary dataset in this folder is managed using Structured streaming in Spark.
Upsert pattern in the gold lakehouse
The upsert pattern in the gold lakehouse is different from the silver lakehouse. The OMOP API used by the healthcare#_msft_omop_silver_gold_transformation notebook creates new IDs for each entry in the delta tables of the gold lakehouse. The API creates these IDs when it ingests or converts new records from the silver to gold lakehouse. The OMOP API also maintains internal mappings between the newly created IDs and their corresponding internal IDs in the silver lakehouse delta table.
The API works as follows:
If converting a record from a silver to gold delta table for the first time, it generates a new ID in the OMOP gold lakehouse and maps it to the original new ID in the silver lakehouse. It then inserts the record into the gold delta table with the newly generated ID.
If the same record in the silver lakehouse undergoes some modification and is ingested again into the gold lakehouse, the OMOP API recognizes the existing record in the gold lakehouse (using the mapping information). It then updates the records in the gold lakehouse with the same ID that it generated before.
Mappings between the newly generated IDs (ADRM_ID) in the gold lakehouse and the original IDs (INTERNAL_ID) for each OMOP delta table are stored in OneLake parquet files. You can locate the parquet files at the following file path:
[OneLakePath]\[workspace]\healthcare#.HealthDataManager\DMHCheckpoint\dtt\dtt_state_db\KEY_MAPPING\[OMOPTableName]_ID_MAPPING
You can also query the parquet files in a Spark notebook to view the mapping.
ImagingMetastore design in the silver lakehouse
A single DICOM file can contain up to 5,000 distinct tags, making it inefficient and resource-intensive to map and create fields for all these tags in the silver lakehouse. However, retaining access to the complete set of tags is essential to prevent data loss and maintain flexibility, especially if you require tags beyond the 29 extracted and represented in the data model. To address this problem, the silver lakehouse ImagingMetastore delta table stores all DICOM tags in the metadata_string
column. These tags are represented as key-value pairs in a stringified JSON format, enabling efficient querying through the SQL analytics endpoint. This approach aligns with standard practices for managing complex JSON data across all fields in the silver lakehouse.
From the ImagingDicom table in the bronze lakehouse to the ImagingMetastore table in the silver lakehouse, the transformation doesn't perform any grouping. Resources are represented at the instance level in both tables. However, the {FHIRResource}.id
is included in the ImagingMetastore table. This value allows you to query all instance-level artifacts associated with a specific study by referencing its unique ID.
Integration with DICOM service
The current integration between the DICOM data transformation capability and the Azure Health Data Services DICOM service supports only Create and Update events. You can create new imaging studies, series, and instances, or even update existing ones. However, the integration doesn't yet support Delete events. If you delete a study, series, or instance in the DICOM service, the DICOM data transformation capability doesn't reflect this change. The imaging data remains unchanged and isn't deleted.
Table warnings
Warnings appear for all tables in each lakehouse where one or more columns use complex object-oriented data types to represent data. In the ImagingDicom and ImagingMetastore tables, the metadata_string
column uses a JSON structure to map DICOM tags as key-value pairs. This design accommodates the limitation of Fabric SQL endpoints, which don't support complex data types such as structs, arrays, and maps. You can query these columns as strings using the SQL endpoint (T-SQL) or work with their native types (structs, arrays, maps) using Spark.