DICOM metadata transformation mapping in healthcare data solutions

Article
11/30/2024

This article explains how the healthcare data solutions environment extracts and transforms DICOM metadata across different lakehouse levels. You can also learn about the end-to-end metadata transformation process and understand the transformation mapping at each level.

The metadata transformation through the ingestion pipeline consists of the following three consecutive stages:

Extraction and transformation of DICOM metadata to bronze delta table
Metadata transformation from bronze to silver delta table
Metadata transformation from silver to gold delta table

The following sections detail the transformation mapping for each stage.

Transformation mapping for DICOM metadata to bronze delta table

There are more than 5000 DICOM tags defined by the DICOM standard, including vendor-specific private tags. This section identifies which tags do we retrieve and explains the extraction process in the bronze lakehouse.

The tag extraction and ImagingDicom delta table creation process includes the following actions:

Extraction from DICOM files: Extract a collection of all the tags from the DICOM (DCM) files in the optimized folder structure in the bronze lakehouse.
Pixel data tag exclusion: Exclude the DICOM pixel data tag (7FE0,0010) and the image pixel data module attributes from the collection. The DICOM pixel data tag includes image/pixel-level details.
JSON mapping: Map all the extracted DICOM tags into a JSON structure of key-value pairs in the following schema:
```
METADATA_JSON_DICT_SCHEMA = MapType
   (
      StringType(),
      StructType([
                   StructField("vr", StringType(), True),
                   StructField("Value", ArrayType(StringType(), True), True)
                 ])
   )
```
These key-value JSON pairs are written to the metadata column in the bronze lakehouse ImagingDicom delta table.

Note

The metadata_string column also stores the metadata as a string because Fabric SQL endpoints don't support complex data types such as structs, arrays, and maps. You can query these columns as strings using the SQL endpoint (T-SQL) or work with their native types (structs, arrays, maps) using Spark.

Extraction and mapping to bronze lakehouse: Further extract the following 29 DICOM tags and write them to the respective destination columns in the ImagingDicom delta table:

Source DICOM tag	Destination column	Required
(0020,000D)	`[studyInstanceUid]`	Yes
(0010,0010)	`[patientName]`	No
(0010,0040)	`[patientSex]`	No
(0010,0020)	`[patientId]`	Yes
(0010,0030)	`[patientBirthDate]`	No
(0008,0050)	`[accessionNumber]`	Yes
(0008,0090)	`[referringPhysicianName]`	Yes
(0008,0020)	`[studyDate]`	Yes
(0008,1030)	`[studyDescription]`	Yes
(0020,000E)	`[seriesInstanceUid]`	Yes
(0008,0060)	`[modality]`	Yes
(0008,0061)	`[modalitiesInStudy]`	Yes
(0040,0244)	`[performedProcedureStepStartDate]`	No
(0008,1090)	`[manufacturerModelName]`	No
(0008,0018)	`[sopInstanceUid]`	Yes
(0008,0030)	`[studyTime]`	Yes
(0008,0201)	`[timezoneOffsetFromUtc]`	Yes
(0020,1206)	`[numberOfStudyRelatedSeries]`	Yes
(0020,1208)	`[numberOfStudyRelatedInstances]`	Yes
(0020,0011)	`[seriesNumber]`	Yes
(0008,103E)	`[seriesDescription]`	Yes
(0020,1209)	`[numberOfSeriesRelatedInstances]`	Yes
(0018,0015)	`[bodyPartExamined]`	Yes
(0020,0060)	`[laterality]`	Yes
(0008,0021)	`[seriesDate]`	Yes
(0008,0031)	`[seriesTime]`	Yes
(0008,0016)	`[sopClassUid]`	Yes
(0020,0013)	`[instanceNumber]`	Yes
(0042,0010)	`[documentTitle]`	Yes

Note

For more information about why we promote these particular 29 DICOM tags, see DICOM tag extraction.
To learn more about the ingestion pattern (append), go to Append pattern in the bronze lakehouse.
The modalitiesInStudy_string column also stores the modalitiesInStudy tag as a string because Fabric SQL endpoints don't support complex data types such as structs, arrays, and maps. You can query these columns as strings using the SQL endpoint (T-SQL) or work with their native types (structs, arrays, maps) using Spark.

DCM file path storage: The full file path for the DCM file is written to the filePath column in the ImagingDicom delta table.
Modification time logging: The latest timestamp at which the DCM file was modified at its source is written to the sourceModifiedAt column in the ImagingDicom delta table.
Namespace storage: The namespace value is written to the sourceSystem column in the ImagingDicom delta table. This value derives from the folder name in the unified folder structure.
- For regular ingestion, the namespace value is the folder name after Files\Process\Imaging\DICOM.
- For Bring Your Own Storage (BYOS) ingestion, the namespace value is the folder name after Files\External\Imaging\DICOM.
Execution time logging: The notebook's execution date and time are written to the createdDatetime column in the ImagingDicom delta table.

Transformation mapping for bronze to silver delta table

The following tables explain the complete mapping for the transformation of DICOM metadata from the bronze lakehouse ImagingDicom delta table to the ImagingMetastore and ImagingStudy delta tables in the silver lakehouse. The ImagingMetastore delta table stores the DICOM tags for each DCM file as JSON key-value pairs within the metadata columns. Copying all the metadata from the bronze to the silver layer preserves data integrity across layers. The ImagingStudy delta table includes the 29 DICOM tags selected for alignment with FHIR standard fields. It also contains more fields to support data tracking and lineage.

Source column in ImagingDicom	Destination column in ImagingMetastore	Mapping details
NA	`msftModifiedDatetime`	Included through the common delta merge logic applied to all tables in the silver layer.
`studyInstanceUid`	`studyInstanceUid`	Direct mapping with a one-to-one relationship. Each value in the source column maps directly to a single corresponding value in the destination.
`seriesInstanceUid`	`seriesInstanceUid`	Direct mapping with a one-to-one relationship.
`sopInstanceUid`	`sopInstanceUid`	Direct mapping with a one-to-one relationship.
`sourceSystem`	`msftSourceSystem`	Direct mapping with a one-to-one relationship.
`metadata`	`metadata`	Direct mapping with a one-to-one relationship.
`metadata_string`	`metadata_string`	Direct mapping with a one-to-one relationship.
`filePath`	`filePath`	Direct mapping with a one-to-one relationship.
`sourceModifiedAt`	`sourceModifiedAt`	Direct mapping with a one-to-one relationship.
NA	`id`	A GUID generated using the Python UUID module.
NA	`msftCreatedDatetime`	Included through the common delta merge logic applied to all tables in the silver layer.

Source column in ImagingDicom	Destination column in ImagingStudy	Mapping details
NA	`msftModifiedDatetime`	Included through the common delta merge logic applied to all tables in the silver layer.
NA	`id`	A GUID generated using the Python UUID module.
NA	`resourceType`	`"ImagingStudy"`
`sourceSystem`	`msftSourceSystem`	Not a direct mapping. The DICOM data transformation capability uses the `sourceSystem` column in the bronze lakehouse to create the Namespace folder when writing the generated NDJSON files to the Process folder. To learn more about the Namespace folder, see Unified folder structure: Folder descriptions. At this stage, the clinical bronze ingestion service uses the Namespace folder name to populate the `msftSourceSystem` column in the silver lakehouse. For example, if the `sourceSystem` value defines as `MyPACSsystem` in the bronze ImagingDicom table, the imaging bronze ingestion service writes the newly created NDJSON files to the following folder structure: `Process\Clinical\FHIR-NDJSON\MyPACSsystem\YYYY\MM\DD\ImagingStudy-<timestamp>.ndjson`. When the clinical bronze ingestion picks up these files, it automatically populates the `msftSourceSystem` column with `MyPACSsystem` from the folder structure and propagates the same value to the silver layer.
NA	`msftFilePath`	File path to the generated ImagingStudy NDJSON in the `Process\Clinical\FHIR-NDJSON\DICOM-HDS` folder.
`filePath`	`extension`	`"extension": [{"url": "lit('file_path')", "valueUrl": "col('FilePath')"}]` The value for `FilePath` includes the ABFS file path in OneLake for all instance-level DCM files that are part of this ImagingStudy.
NA	meta	`"meta": {"lastUpdated":"current_timestamp()"}`
`studyInstanceUid` `accessionNumber`	`identifier`	`ImagingStudy.identifier.where(system = 'urn:dicom:uid')` => `StudyInstanceUID` `ImagingStudy.identifier.where(type.coding.system = 'http://terminology.hl7.org/CodeSystem/v2-0203'` and `type.coding.code = 'ACSN'))` => `"AccessionNumber"`
NA	`status`	`"available"`
`modalitiesInStudy`	`modality`	`modality = List{code = col('ModalitiesInStudy')}`
`patientId`	`subject`	`""subject"": {""identifier"": {""type"": {""coding"": [{""system"": ""lit('http://terminology.hl7.org/CodeSystem/v2-0203')"",""code"": ""lit('MR')""}]},""value"": ""col('PatientID')""},""type": ""lit('Patient')""},"`
`patientName` `patientBirthDate` `patientSex`	`subject`	`"subject": {"extension": [{"url": "lit('name')", "valueString": "col('PatientName')"}, {"url": "lit('birthDate')", "valueDateTime": "col('PatientBirthDate')"}, {"url": "lit('gender')", "valueCode": "col('PatientSex')"}]}`
`studyDate` `studyTime` `timezoneOffsetFromUtc`	`started`	`concat_ws(' ', col('StudyDate'), col('StudyTime'), col('TimezoneOffsetFromUTC'))`
`numberOfStudyRelatedSeries`	`numberOfSeries`	`col('NumberOfStudyRelatedSeries')`
`numberOfStudyRelatedInstances`	`numberOfInstances`	`col('NumberOfStudyRelatedInstances')`
`studyDescription`	`description`	`col('StudyDescription')`
`seriesInstanceUid` `seriesDate` `seriesTime` `timezoneOffsetFromUtc`  `modality` `laterality` `bodyPartExamined`  `numberOfSeriesRelatedInstances` `seriesDescription` `seriesNumber` `sopInstanceUid`  `sopClassUid`  `instanceNumber`  `documentTitle`	`series`	{"series": [{"uid": "col('SeriesInstanceUID')", "started": {"tag": "SeriesDate,SeriesTime,TimezoneOffsetFromUTC", "calc": "concat_ws(' ', col('SeriesDate'), col('SeriesTime'), col('TimezoneOffsetFromUTC')).cast(TimestampType())"}, "modality": {"code": "col('Modality')", "system": "lit('https://dicom.nema.org/resources/ontology/DCM')"}, "laterality": {"display": "col('Laterality')"}, "bodySite": {"display": "col('BodyPartExamined')"}, "numberOfInstances": "col('NumberOfSeriesRelatedInstances')", "description": "col('SeriesDescription')", "number": "col('SeriesNumber')", "instance": [{"uid": "col('SOPInstanceUID')", "sopClass": {"code": "col('SOPClassUID')"}, "number": "col('InstanceNumber')", "title": "col('DocumentTitle')", "extension": [{"url": "lit('file_path')", "valueUrl": "col('FilePath')"}]}]}]}
NA	`meta.lastupdated`	`Currenttimestamp()`
NA	`msftCreatedDatetime`	Included through the common delta merge logic applied to all tables in the silver layer.

Note

Columns with the suffix Orig are created in the silver lakehouse to store original values of fields sourced from the bronze layer. This standard practice includes the following columns in the ImagingStudy table: meta_lastUpdatedOrig, identifierOrig, idOrig, and startedOrig.
Columns with the _string suffix store stringified versions of fields containing complex JSON data, enabling querying through the SQL analytics endpoint. This practice applies across all tables in the silver lakehouse and includes the following columns in the ImagingStudy table: meta_string, text_string, contained_string, identifier_string, modality_string, subject_string, encounter_string, basedOn_string, referrer_string, interpreter_string, endpoint_string, procedureReference_string, procedureCode_string, location_string, reasonCode_string, reasonReference_string, note_string, series_string, and identifierOrig_string.
Some fields in the ImagingStudy table are generated to align with the FHIR ImagingStudy schema. However, since the bronze layer doesn't extract data from the DCM files that accurately corresponds to these fields, the related columns in the silver table remain empty. As a result, the following columns in the ImagingStudy table contain null values: implicitRules, language, text, contained, encounter, basedOn, referrer, interpreter, endpoint, procedureReference, procedureCode, location, reasonCode, reasonReference, and note.

Transformation mapping for silver to gold delta table

The following table explains the complete mapping for the transformation of DICOM data in the silver lakehouse ImagingStudy delta table to the Observational Medical Outcomes Partnership (OMOP) Image_Occurrence delta table in the gold lakehouse.

Source column in ImagingStudy	Destination column in OMOP Image_Occurrence	Data type	Mapping details
`series.started`	`image_occurrence_date`	date	Imaging procedure (series) occurrence date.
`series.modality` (combination of `series.modality.code` and `series.modality.system`)	`modality_concept_id`	string	`concat_ws('<->', exp_series.modality.code, exp_series.modality.system)`
NA	`SourceTable`	string	`'ImagingStudy_FHIR'`
`id`	`msftSourceRecordId`	string	System generated ID of the source record.
`identifier['studyInstanceUid']`	`image_study_uid`	string	DICOM Study UID.
`subject`	`person_id`	integer	Person ID of the person associated with the recorded procedure.
An array of dictionary values, where the key is `instance.uid` and value is `instance.extension[0].valueUrl`	`local_path`	string	`to_json(transform(exp_series.instance, x -> map('instanceid', x.uid, 'local_path', from_json(x.extension, 'array<struct<valueUrl:string,url:string>>')[0].valueUrl)))`
NA	`SourceModifiedOn`	datetime	Record modification date.
`resourceType`	`msftSourceTableName`	string	`'Imaging Study'`
`msftModifiedDatetime`	`msftModifiedDatetime`	datetime	Direct mapping with a one-to-one relationship.
`series.uid`	`image_occurrence_id`	string	Unique key given to an imaging study record.
`series.modality.code`	`modality_source_value`	string	Modality of the series.

Note

Some fields in the gold table generate to align with the OMOP Image_Occurrence schema. However, since the bronze layer doesn't extract data that accurately corresponds to these fields, the related columns in the gold table remain empty. As a result, the following columns in the Image_Occurrence table contain null values: visit_occurrence_id, procedure_occurrence_id, and anatomic_site_concept_id.

Share via

DICOM metadata transformation mapping in healthcare data solutions

Transformation mapping for DICOM metadata to bronze delta table

Transformation mapping for bronze to silver delta table

Transformation mapping for silver to gold delta table

Feedback

Additional resources

Share via

DICOM metadata transformation mapping in healthcare data solutions

Transformation mapping for DICOM metadata to bronze delta table

Transformation mapping for bronze to silver delta table

Transformation mapping for silver to gold delta table

Related information

Feedback

Additional resources