DICOM metadata transformation mapping in healthcare data solutions

This article explains how the healthcare data solutions environment extracts and transforms DICOM metadata across different lakehouse levels. You can also learn about the end-to-end metadata transformation process and understand the transformation mapping at each level.

The metadata transformation through the ingestion pipeline consists of the following three consecutive stages:

  1. Extraction and transformation of DICOM metadata to bronze delta table
  2. Metadata transformation from bronze to silver delta table
  3. Metadata transformation from silver to gold delta table

The following sections detail the transformation mapping for each stage.

Transformation mapping for DICOM metadata to bronze delta table

There are more than 5000 DICOM tags defined by the DICOM standard, including vendor-specific private tags. This section identifies which tags do we retrieve and explains the extraction process in the bronze lakehouse.

The tag extraction and ImagingDicom delta table creation process includes the following actions:

  1. Extraction from DICOM files: Extract a collection of all the tags from the DICOM (DCM) files in the optimized folder structure in the bronze lakehouse.

  2. Pixel data tag exclusion: Exclude the DICOM pixel data tag (7FE0,0010) and the image pixel data module attributes from the collection. The DICOM pixel data tag includes image/pixel-level details.

  3. JSON mapping: Map all the extracted DICOM tags into a JSON structure of key-value pairs in the following schema:

    METADATA_JSON_DICT_SCHEMA = MapType
       (
          StringType(),
          StructType([
                       StructField("vr", StringType(), True),
                       StructField("Value", ArrayType(StringType(), True), True)
                     ])
       )
    

    These key-value JSON pairs are written to the metadata column in the bronze lakehouse ImagingDicom delta table.

    Note

    The metadata_string column also stores the metadata as a string because Fabric SQL endpoints don't support complex data types such as structs, arrays, and maps. You can query these columns as strings using the SQL endpoint (T-SQL) or work with their native types (structs, arrays, maps) using Spark.

  4. Extraction and mapping to bronze lakehouse: Further extract the following 29 DICOM tags and write them to the respective destination columns in the ImagingDicom delta table:

    Source DICOM tag Destination column Required
    (0020,000D) [studyInstanceUid] Yes
    (0010,0010) [patientName] No
    (0010,0040) [patientSex] No
    (0010,0020) [patientId] Yes
    (0010,0030) [patientBirthDate] No
    (0008,0050) [accessionNumber] Yes
    (0008,0090) [referringPhysicianName] Yes
    (0008,0020) [studyDate] Yes
    (0008,1030) [studyDescription] Yes
    (0020,000E) [seriesInstanceUid] Yes
    (0008,0060) [modality] Yes
    (0008,0061) [modalitiesInStudy] Yes
    (0040,0244) [performedProcedureStepStartDate] No
    (0008,1090) [manufacturerModelName] No
    (0008,0018) [sopInstanceUid] Yes
    (0008,0030) [studyTime] Yes
    (0008,0201) [timezoneOffsetFromUtc] Yes
    (0020,1206) [numberOfStudyRelatedSeries] Yes
    (0020,1208) [numberOfStudyRelatedInstances] Yes
    (0020,0011) [seriesNumber] Yes
    (0008,103E) [seriesDescription] Yes
    (0020,1209) [numberOfSeriesRelatedInstances] Yes
    (0018,0015) [bodyPartExamined] Yes
    (0020,0060) [laterality] Yes
    (0008,0021) [seriesDate] Yes
    (0008,0031) [seriesTime] Yes
    (0008,0016) [sopClassUid] Yes
    (0020,0013) [instanceNumber] Yes
    (0042,0010) [documentTitle] Yes

    Note

    • For more information about why we promote these particular 29 DICOM tags, see DICOM tag extraction.

    • To learn more about the ingestion pattern (append), go to Append pattern in the bronze lakehouse.

    • The modalitiesInStudy_string column also stores the modalitiesInStudy tag as a string because Fabric SQL endpoints don't support complex data types such as structs, arrays, and maps. You can query these columns as strings using the SQL endpoint (T-SQL) or work with their native types (structs, arrays, maps) using Spark.

  5. DCM file path storage: The full file path for the DCM file is written to the filePath column in the ImagingDicom delta table.

  6. Modification time logging: The latest timestamp at which the DCM file was modified at its source is written to the sourceModifiedAt column in the ImagingDicom delta table.

  7. Namespace storage: The namespace value is written to the sourceSystem column in the ImagingDicom delta table. This value derives from the folder name in the unified folder structure.

    • For regular ingestion, the namespace value is the folder name after Files\Process\Imaging\DICOM.
    • For Bring Your Own Storage (BYOS) ingestion, the namespace value is the folder name after Files\External\Imaging\DICOM.
  8. Execution time logging: The notebook's execution date and time are written to the createdDatetime column in the ImagingDicom delta table.

Transformation mapping for bronze to silver delta table

The following tables explain the complete mapping for the transformation of DICOM metadata from the bronze lakehouse ImagingDicom delta table to the ImagingMetastore and ImagingStudy delta tables in the silver lakehouse. The ImagingMetastore delta table stores the DICOM tags for each DCM file as JSON key-value pairs within the metadata columns. Copying all the metadata from the bronze to the silver layer preserves data integrity across layers. The ImagingStudy delta table includes the 29 DICOM tags selected for alignment with FHIR standard fields. It also contains more fields to support data tracking and lineage.

Source column in ImagingDicom Destination column in ImagingMetastore Mapping details
NA msftModifiedDatetime Included through the common delta merge logic applied to all tables in the silver layer.
studyInstanceUid studyInstanceUid Direct mapping with a one-to-one relationship. Each value in the source column maps directly to a single corresponding value in the destination.
seriesInstanceUid seriesInstanceUid Direct mapping with a one-to-one relationship.
sopInstanceUid sopInstanceUid Direct mapping with a one-to-one relationship.
sourceSystem msftSourceSystem Direct mapping with a one-to-one relationship.
metadata metadata Direct mapping with a one-to-one relationship.
metadata_string metadata_string Direct mapping with a one-to-one relationship.
filePath filePath Direct mapping with a one-to-one relationship.
sourceModifiedAt sourceModifiedAt Direct mapping with a one-to-one relationship.
NA id A GUID generated using the Python UUID module.
NA msftCreatedDatetime Included through the common delta merge logic applied to all tables in the silver layer.
Source column in ImagingDicom Destination column in ImagingStudy Mapping details
NA msftModifiedDatetime Included through the common delta merge logic applied to all tables in the silver layer.
NA id A GUID generated using the Python UUID module.
NA resourceType "ImagingStudy"
sourceSystem msftSourceSystem Not a direct mapping. The DICOM data transformation capability uses the sourceSystem column in the bronze lakehouse to create the Namespace folder when writing the generated NDJSON files to the Process folder. To learn more about the Namespace folder, see Unified folder structure: Folder descriptions. At this stage, the clinical bronze ingestion service uses the Namespace folder name to populate the msftSourceSystem column in the silver lakehouse.

For example, if the sourceSystem value defines as MyPACSsystem in the bronze ImagingDicom table, the imaging bronze ingestion service writes the newly created NDJSON files to the following folder structure: Process\Clinical\FHIR-NDJSON\MyPACSsystem\YYYY\MM\DD\ImagingStudy-<timestamp>.ndjson. When the clinical bronze ingestion picks up these files, it automatically populates the msftSourceSystem column with MyPACSsystem from the folder structure and propagates the same value to the silver layer.
NA msftFilePath File path to the generated ImagingStudy NDJSON in the Process\Clinical\FHIR-NDJSON\DICOM-HDS folder.
filePath extension "extension": [{"url": "lit('file_path')", "valueUrl": "col('FilePath')"}]

The value for FilePath includes the ABFS file path in OneLake for all instance-level DCM files that are part of this ImagingStudy.
NA meta "meta": {"lastUpdated":"current_timestamp()"}
studyInstanceUid
accessionNumber
identifier ImagingStudy.identifier.where(system = 'urn:dicom:uid') => StudyInstanceUID

ImagingStudy.identifier.where(type.coding.system = 'http://terminology.hl7.org/CodeSystem/v2-0203' and type.coding.code = 'ACSN')) => "AccessionNumber"
NA status "available"
modalitiesInStudy modality modality = List{code = col('ModalitiesInStudy')}
patientId subject ""subject"": {""identifier"": {""type"": {""coding"": [{""system"": ""lit('http://terminology.hl7.org/CodeSystem/v2-0203')"",""code"": ""lit('MR')""}]},""value"": ""col('PatientID')""},""type": ""lit('Patient')""},"
patientName
patientBirthDate
patientSex
subject "subject": {"extension": [{"url": "lit('name')", "valueString": "col('PatientName')"}, {"url": "lit('birthDate')", "valueDateTime": "col('PatientBirthDate')"}, {"url": "lit('gender')", "valueCode": "col('PatientSex')"}]}
studyDate
studyTime
timezoneOffsetFromUtc
started concat_ws(' ', col('StudyDate'), col('StudyTime'), col('TimezoneOffsetFromUTC'))
numberOfStudyRelatedSeries numberOfSeries col('NumberOfStudyRelatedSeries')
numberOfStudyRelatedInstances numberOfInstances col('NumberOfStudyRelatedInstances')
studyDescription description col('StudyDescription')
seriesInstanceUid
seriesDate
seriesTime
timezoneOffsetFromUtc
modality
laterality
bodyPartExamined
numberOfSeriesRelatedInstances
seriesDescription
seriesNumber
sopInstanceUid
sopClassUid
instanceNumber
documentTitle
series {"series": [{"uid": "col('SeriesInstanceUID')", "started": {"tag": "SeriesDate,SeriesTime,TimezoneOffsetFromUTC", "calc": "concat_ws(' ', col('SeriesDate'), col('SeriesTime'), col('TimezoneOffsetFromUTC')).cast(TimestampType())"}, "modality": {"code": "col('Modality')", "system": "lit('https://dicom.nema.org/resources/ontology/DCM')"}, "laterality": {"display": "col('Laterality')"}, "bodySite": {"display": "col('BodyPartExamined')"}, "numberOfInstances": "col('NumberOfSeriesRelatedInstances')", "description": "col('SeriesDescription')", "number": "col('SeriesNumber')", "instance": [{"uid": "col('SOPInstanceUID')", "sopClass": {"code": "col('SOPClassUID')"}, "number": "col('InstanceNumber')", "title": "col('DocumentTitle')", "extension": [{"url": "lit('file_path')", "valueUrl": "col('FilePath')"}]}]}]}
NA meta.lastupdated Currenttimestamp()
NA msftCreatedDatetime Included through the common delta merge logic applied to all tables in the silver layer.

Note

  • Columns with the suffix Orig are created in the silver lakehouse to store original values of fields sourced from the bronze layer. This standard practice includes the following columns in the ImagingStudy table: meta_lastUpdatedOrig, identifierOrig, idOrig, and startedOrig.

  • Columns with the _string suffix store stringified versions of fields containing complex JSON data, enabling querying through the SQL analytics endpoint. This practice applies across all tables in the silver lakehouse and includes the following columns in the ImagingStudy table: meta_string, text_string, contained_string, identifier_string, modality_string, subject_string, encounter_string, basedOn_string, referrer_string, interpreter_string, endpoint_string, procedureReference_string, procedureCode_string, location_string, reasonCode_string, reasonReference_string, note_string, series_string, and identifierOrig_string.

  • Some fields in the ImagingStudy table are generated to align with the FHIR ImagingStudy schema. However, since the bronze layer doesn't extract data from the DCM files that accurately corresponds to these fields, the related columns in the silver table remain empty. As a result, the following columns in the ImagingStudy table contain null values: implicitRules, language, text, contained, encounter, basedOn, referrer, interpreter, endpoint, procedureReference, procedureCode, location, reasonCode, reasonReference, and note.

Transformation mapping for silver to gold delta table

The following table explains the complete mapping for the transformation of DICOM data in the silver lakehouse ImagingStudy delta table to the Observational Medical Outcomes Partnership (OMOP) Image_Occurrence delta table in the gold lakehouse.

Source column in ImagingStudy Destination column in OMOP Image_Occurrence Data type Mapping details
series.started image_occurrence_date date Imaging procedure (series) occurrence date.
series.modality (combination of series.modality.code and series.modality.system) modality_concept_id string concat_ws('<->', exp_series.modality.code, exp_series.modality.system)
NA SourceTable string 'ImagingStudy_FHIR'
id msftSourceRecordId string System generated ID of the source record.
identifier['studyInstanceUid'] image_study_uid string DICOM Study UID.
subject person_id integer Person ID of the person associated with the recorded procedure.
An array of dictionary values, where the key is instance.uid and value is instance.extension[0].valueUrl local_path string to_json(transform(exp_series.instance, x -> map('instanceid', x.uid, 'local_path', from_json(x.extension, 'array<struct<valueUrl:string,url:string>>')[0].valueUrl)))
NA SourceModifiedOn datetime Record modification date.
resourceType msftSourceTableName string 'Imaging Study'
msftModifiedDatetime msftModifiedDatetime datetime Direct mapping with a one-to-one relationship.
series.uid image_occurrence_id string Unique key given to an imaging study record.
series.modality.code modality_source_value string Modality of the series.

Note

Some fields in the gold table generate to align with the OMOP Image_Occurrence schema. However, since the bronze layer doesn't extract data that accurately corresponds to these fields, the related columns in the gold table remain empty. As a result, the following columns in the Image_Occurrence table contain null values: visit_occurrence_id, procedure_occurrence_id, and anatomic_site_concept_id.