DICOM metadata transformation mapping in healthcare data solutions
This article explains how the healthcare data solutions environment extracts and transforms DICOM metadata across different lakehouse levels. You can also learn about the end-to-end metadata transformation process and understand the transformation mapping at each level.
The metadata transformation through the ingestion pipeline consists of the following three consecutive stages:
- Extraction and transformation of DICOM metadata to bronze delta table
- Metadata transformation from bronze to silver delta table
- Metadata transformation from silver to gold delta table
The following sections detail the transformation mapping for each stage.
Transformation mapping for DICOM metadata to bronze delta table
There are more than 5000 DICOM tags defined by the DICOM standard, including vendor-specific private tags. This section identifies which tags do we retrieve and explains the extraction process in the bronze lakehouse.
The tag extraction and ImagingDicom delta table creation process includes the following actions:
Extraction from DICOM files: Extract a collection of all the tags from the DICOM (DCM) files in the optimized folder structure in the bronze lakehouse.
Pixel data tag exclusion: Exclude the DICOM pixel data tag (7FE0,0010) and the image pixel data module attributes from the collection. The DICOM pixel data tag includes image/pixel-level details.
JSON mapping: Map all the extracted DICOM tags into a JSON structure of key-value pairs in the following schema:
METADATA_JSON_DICT_SCHEMA = MapType ( StringType(), StructType([ StructField("vr", StringType(), True), StructField("Value", ArrayType(StringType(), True), True) ]) )
These key-value JSON pairs are written to the metadata column in the bronze lakehouse ImagingDicom delta table.
Note
The
metadata_string
column also stores the metadata as a string because Fabric SQL endpoints don't support complex data types such as structs, arrays, and maps. You can query these columns as strings using the SQL endpoint (T-SQL) or work with their native types (structs, arrays, maps) using Spark.Extraction and mapping to bronze lakehouse: Further extract the following 29 DICOM tags and write them to the respective destination columns in the ImagingDicom delta table:
Source DICOM tag Destination column Required (0020,000D) [studyInstanceUid]
Yes (0010,0010) [patientName]
No (0010,0040) [patientSex]
No (0010,0020) [patientId]
Yes (0010,0030) [patientBirthDate]
No (0008,0050) [accessionNumber]
Yes (0008,0090) [referringPhysicianName]
Yes (0008,0020) [studyDate]
Yes (0008,1030) [studyDescription]
Yes (0020,000E) [seriesInstanceUid]
Yes (0008,0060) [modality]
Yes (0008,0061) [modalitiesInStudy]
Yes (0040,0244) [performedProcedureStepStartDate]
No (0008,1090) [manufacturerModelName]
No (0008,0018) [sopInstanceUid]
Yes (0008,0030) [studyTime]
Yes (0008,0201) [timezoneOffsetFromUtc]
Yes (0020,1206) [numberOfStudyRelatedSeries]
Yes (0020,1208) [numberOfStudyRelatedInstances]
Yes (0020,0011) [seriesNumber]
Yes (0008,103E) [seriesDescription]
Yes (0020,1209) [numberOfSeriesRelatedInstances]
Yes (0018,0015) [bodyPartExamined]
Yes (0020,0060) [laterality]
Yes (0008,0021) [seriesDate]
Yes (0008,0031) [seriesTime]
Yes (0008,0016) [sopClassUid]
Yes (0020,0013) [instanceNumber]
Yes (0042,0010) [documentTitle]
Yes Note
For more information about why we promote these particular 29 DICOM tags, see DICOM tag extraction.
To learn more about the ingestion pattern (append), go to Append pattern in the bronze lakehouse.
The
modalitiesInStudy_string
column also stores the modalitiesInStudy tag as a string because Fabric SQL endpoints don't support complex data types such as structs, arrays, and maps. You can query these columns as strings using the SQL endpoint (T-SQL) or work with their native types (structs, arrays, maps) using Spark.
DCM file path storage: The full file path for the DCM file is written to the
filePath
column in the ImagingDicom delta table.Modification time logging: The latest timestamp at which the DCM file was modified at its source is written to the
sourceModifiedAt
column in the ImagingDicom delta table.Namespace storage: The namespace value is written to the
sourceSystem
column in the ImagingDicom delta table. This value derives from the folder name in the unified folder structure.- For regular ingestion, the namespace value is the folder name after
Files\Process\Imaging\DICOM
. - For Bring Your Own Storage (BYOS) ingestion, the namespace value is the folder name after
Files\External\Imaging\DICOM
.
- For regular ingestion, the namespace value is the folder name after
Execution time logging: The notebook's execution date and time are written to the
createdDatetime
column in the ImagingDicom delta table.
Transformation mapping for bronze to silver delta table
The following tables explain the complete mapping for the transformation of DICOM metadata from the bronze lakehouse ImagingDicom delta table to the ImagingMetastore and ImagingStudy delta tables in the silver lakehouse. The ImagingMetastore delta table stores the DICOM tags for each DCM file as JSON key-value pairs within the metadata columns. Copying all the metadata from the bronze to the silver layer preserves data integrity across layers. The ImagingStudy delta table includes the 29 DICOM tags selected for alignment with FHIR standard fields. It also contains more fields to support data tracking and lineage.
Source column in ImagingDicom | Destination column in ImagingMetastore | Mapping details |
---|---|---|
NA | msftModifiedDatetime |
Included through the common delta merge logic applied to all tables in the silver layer. |
studyInstanceUid |
studyInstanceUid |
Direct mapping with a one-to-one relationship. Each value in the source column maps directly to a single corresponding value in the destination. |
seriesInstanceUid |
seriesInstanceUid |
Direct mapping with a one-to-one relationship. |
sopInstanceUid |
sopInstanceUid |
Direct mapping with a one-to-one relationship. |
sourceSystem |
msftSourceSystem |
Direct mapping with a one-to-one relationship. |
metadata |
metadata |
Direct mapping with a one-to-one relationship. |
metadata_string |
metadata_string |
Direct mapping with a one-to-one relationship. |
filePath |
filePath |
Direct mapping with a one-to-one relationship. |
sourceModifiedAt |
sourceModifiedAt |
Direct mapping with a one-to-one relationship. |
NA | id |
A GUID generated using the Python UUID module. |
NA | msftCreatedDatetime |
Included through the common delta merge logic applied to all tables in the silver layer. |
Source column in ImagingDicom | Destination column in ImagingStudy | Mapping details |
---|---|---|
NA | msftModifiedDatetime |
Included through the common delta merge logic applied to all tables in the silver layer. |
NA | id |
A GUID generated using the Python UUID module. |
NA | resourceType |
"ImagingStudy" |
sourceSystem |
msftSourceSystem |
Not a direct mapping. The DICOM data transformation capability uses the sourceSystem column in the bronze lakehouse to create the Namespace folder when writing the generated NDJSON files to the Process folder. To learn more about the Namespace folder, see Unified folder structure: Folder descriptions. At this stage, the clinical bronze ingestion service uses the Namespace folder name to populate the msftSourceSystem column in the silver lakehouse. For example, if the sourceSystem value defines as MyPACSsystem in the bronze ImagingDicom table, the imaging bronze ingestion service writes the newly created NDJSON files to the following folder structure: Process\Clinical\FHIR-NDJSON\MyPACSsystem\YYYY\MM\DD\ImagingStudy-<timestamp>.ndjson . When the clinical bronze ingestion picks up these files, it automatically populates the msftSourceSystem column with MyPACSsystem from the folder structure and propagates the same value to the silver layer. |
NA | msftFilePath |
File path to the generated ImagingStudy NDJSON in the Process\Clinical\FHIR-NDJSON\DICOM-HDS folder. |
filePath |
extension |
"extension": [{"url": "lit('file_path')", "valueUrl": "col('FilePath')"}] The value for FilePath includes the ABFS file path in OneLake for all instance-level DCM files that are part of this ImagingStudy. |
NA | meta | "meta": {"lastUpdated":"current_timestamp()"} |
studyInstanceUid accessionNumber |
identifier |
ImagingStudy.identifier.where(system = 'urn:dicom:uid') => StudyInstanceUID ImagingStudy.identifier.where(type.coding.system = 'http://terminology.hl7.org/CodeSystem/v2-0203' and type.coding.code = 'ACSN')) => "AccessionNumber" |
NA | status |
"available" |
modalitiesInStudy |
modality |
modality = List{code = col('ModalitiesInStudy')} |
patientId |
subject |
""subject"": {""identifier"": {""type"": {""coding"": [{""system"": ""lit('http://terminology.hl7.org/CodeSystem/v2-0203')"",""code"": ""lit('MR')""}]},""value"": ""col('PatientID')""},""type": ""lit('Patient')""}," |
patientName patientBirthDate patientSex |
subject |
"subject": {"extension": [{"url": "lit('name')", "valueString": "col('PatientName')"}, {"url": "lit('birthDate')", "valueDateTime": "col('PatientBirthDate')"}, {"url": "lit('gender')", "valueCode": "col('PatientSex')"}]} |
studyDate studyTime timezoneOffsetFromUtc |
started |
concat_ws(' ', col('StudyDate'), col('StudyTime'), col('TimezoneOffsetFromUTC')) |
numberOfStudyRelatedSeries |
numberOfSeries |
col('NumberOfStudyRelatedSeries') |
numberOfStudyRelatedInstances |
numberOfInstances |
col('NumberOfStudyRelatedInstances') |
studyDescription |
description |
col('StudyDescription') |
seriesInstanceUid seriesDate seriesTime timezoneOffsetFromUtc modality laterality bodyPartExamined numberOfSeriesRelatedInstances seriesDescription seriesNumber sopInstanceUid sopClassUid instanceNumber documentTitle |
series |
{"series": [{"uid": "col('SeriesInstanceUID')", "started": {"tag": "SeriesDate,SeriesTime,TimezoneOffsetFromUTC", "calc": "concat_ws(' ', col('SeriesDate'), col('SeriesTime'), col('TimezoneOffsetFromUTC')).cast(TimestampType())"}, "modality": {"code": "col('Modality')", "system": "lit('https://dicom.nema.org/resources/ontology/DCM')"}, "laterality": {"display": "col('Laterality')"}, "bodySite": {"display": "col('BodyPartExamined')"}, "numberOfInstances": "col('NumberOfSeriesRelatedInstances')", "description": "col('SeriesDescription')", "number": "col('SeriesNumber')", "instance": [{"uid": "col('SOPInstanceUID')", "sopClass": {"code": "col('SOPClassUID')"}, "number": "col('InstanceNumber')", "title": "col('DocumentTitle')", "extension": [{"url": "lit('file_path')", "valueUrl": "col('FilePath')"}]}]}]} |
NA | meta.lastupdated |
Currenttimestamp() |
NA | msftCreatedDatetime |
Included through the common delta merge logic applied to all tables in the silver layer. |
Note
Columns with the suffix
Orig
are created in the silver lakehouse to store original values of fields sourced from the bronze layer. This standard practice includes the following columns in the ImagingStudy table:meta_lastUpdatedOrig
,identifierOrig
,idOrig
, andstartedOrig
.Columns with the
_string
suffix store stringified versions of fields containing complex JSON data, enabling querying through the SQL analytics endpoint. This practice applies across all tables in the silver lakehouse and includes the following columns in the ImagingStudy table:meta_string
,text_string
,contained_string
,identifier_string
,modality_string
,subject_string
,encounter_string
,basedOn_string
,referrer_string
,interpreter_string
,endpoint_string
,procedureReference_string
,procedureCode_string
,location_string
,reasonCode_string
,reasonReference_string
,note_string
,series_string
, andidentifierOrig_string
.Some fields in the ImagingStudy table are generated to align with the FHIR ImagingStudy schema. However, since the bronze layer doesn't extract data from the DCM files that accurately corresponds to these fields, the related columns in the silver table remain empty. As a result, the following columns in the ImagingStudy table contain null values:
implicitRules
,language
,text
,contained
,encounter
,basedOn
,referrer
,interpreter
,endpoint
,procedureReference
,procedureCode
,location
,reasonCode
,reasonReference
, andnote
.
Transformation mapping for silver to gold delta table
The following table explains the complete mapping for the transformation of DICOM data in the silver lakehouse ImagingStudy delta table to the Observational Medical Outcomes Partnership (OMOP) Image_Occurrence delta table in the gold lakehouse.
Source column in ImagingStudy | Destination column in OMOP Image_Occurrence | Data type | Mapping details |
---|---|---|---|
series.started |
image_occurrence_date |
date | Imaging procedure (series) occurrence date. |
series.modality (combination of series.modality.code and series.modality.system ) |
modality_concept_id |
string | concat_ws('<->', exp_series.modality.code, exp_series.modality.system) |
NA | SourceTable |
string | 'ImagingStudy_FHIR' |
id |
msftSourceRecordId |
string | System generated ID of the source record. |
identifier['studyInstanceUid'] |
image_study_uid |
string | DICOM Study UID. |
subject |
person_id |
integer | Person ID of the person associated with the recorded procedure. |
An array of dictionary values, where the key is instance.uid and value is instance.extension[0].valueUrl |
local_path |
string | to_json(transform(exp_series.instance, x -> map('instanceid', x.uid, 'local_path', from_json(x.extension, 'array<struct<valueUrl:string,url:string>>')[0].valueUrl))) |
NA | SourceModifiedOn |
datetime | Record modification date. |
resourceType |
msftSourceTableName |
string | 'Imaging Study' |
msftModifiedDatetime |
msftModifiedDatetime |
datetime | Direct mapping with a one-to-one relationship. |
series.uid |
image_occurrence_id |
string | Unique key given to an imaging study record. |
series.modality.code |
modality_source_value |
string | Modality of the series. |
Note
Some fields in the gold table generate to align with the OMOP Image_Occurrence schema. However, since the bronze layer doesn't extract data that accurately corresponds to these fields, the related columns in the gold table remain empty. As a result, the following columns in the Image_Occurrence table contain null values: visit_occurrence_id
, procedure_occurrence_id
, and anatomic_site_concept_id
.