Ingest large-scale imaging data with inventory-based ingestion
Inventory-based ingestion lets you ingest large-scale imaging data. The initial data load can sometimes be substantial, often hundreds of millions of records, requiring an optimal method to process data and avoid out-of-memory issues. The DICOM data transformation capability achieves large-scale data ingestion by moving the Fabric Spark engine's file listing step outside the core processing logic. This approach helps avoid memory-intensive activities during pipeline execution.
You must provide the file listing information in a designated format within a separate file, called an inventory file. This information enables the capability to read and process the actual DCM files in manageable batches and support ingestion at scale.
To use inventory-based ingestion:
- Prepare the inventory files: Generate the inventory files with the prerequisite information. The inventory files must be in the parquet format.
- Enable inventory-based ingestion: Shortcut the inventory and DCM files, update the configuration, and run the data ingestion in your healthcare data solutions workspace.
The following sections outline the configuration and execution details.
Prepare the inventory files using Azure Storage blob inventory
Azure Storage blob inventory provides a list of the containers, blobs, blob versions, and snapshots in your storage account, along with their associated properties. It generates an output report in either comma-separated values (CSV) or Apache Parquet format. You can use the generated report to provide the prerequisite information in parquet format for inventory-based ingestion. To learn how to generate an Azure storage blob inventory report, see Enable inventory reports.
Ensure the inventory reports you generate contain at least the following fields:
Name
Last-Modified
hdi_isfolder
Prepare the inventory files using other inventories
If your inventory files are in a non-Azure based storage, provide the following fields within the files:
filePath
: A string column containing the relative file path starting from the container that you create a shortcut for in your healthcare data solutions lakehouse.sourceModifiedAt
: A timestamp column containing the file modification date.
After generating the inventory reports with the prerequisite fields in your storage container, enable inventory-based ingestion in your healthcare data solutions workspace.
Enable inventory-based ingestion
After preparing the inventory files, deploy the DICOM data transformation capability and follow these steps to ingest data using inventory-based ingestion.
Step 1: Shortcut the inventory and DCM files
Go to the bronze lakehouse and select the ellipsis beside the file path
Files\Inventory\Imaging\DICOM\DICOM-HDS\InventoryFiles
.Select New shortcut > Azure Data Lake Storage Gen2.
Select your inventory files and create the shortcut.
Select the ellipsis beside the following folder path:
Files\Inventory\Imaging\DICOM\DICOM-HDS
.Select New shortcut > Azure Data Lake Storage Gen2.
Select the folder containing your DCM files and create the shortcut.
Step 2: Update the ingestion pattern parameters in the admin lakehouse
Go to the admin lakehouse.
Under
Files\system-configurations
, copy the contents of the deploymentParametersConfiguration.json file and create a local JSON file.If using an Azure inventory, modify the following parameters in the local JSON file:
"ingestion_pattern": "2"
"move_failed_files": "false"
"compression_enabled": "false"
For a non-Azure inventory, modify the following parameters in the JSON file:
"ingestion_pattern": "2"
"azure_blob_storage_inventory": "false"
"move_failed_files": "false"
"compression_enabled": "false"
Name the JSON file deploymentParametersConfiguration.json.
Upload the modified JSON file to the original location in the admin lakehouse and overwrite the existing file.
Step 3: Run the imaging pipeline
Open the imaging pipeline healthcare#_msft_imaging_with_clinical_foundation_ingestion.
Optionally, select the healthcare#_msft_raw_process_movement notebook activity and set its activity state to Deactivated. This step skips the file movement from the Ingest to Process folder as it isn't required for inventory-based ingestion.
Run the pipeline.
Incremental data load with inventory-based ingestion
You can also use inventory-based ingestion for loading incremental data, even for smaller data volumes. In this scenario, the DCM files are ingested incrementally based on the maximum sourceModifiedAt
value of the file for a given namespace and inventory file path. Because of this dependency, the ImagingDicom table in the bronze lakehouse should never be purged.
To incrementally load imaging data with inventory-based ingestion, follow these recommendations:
- Create a shortcut for the inventory files in the designated folder.
- Place only one set of inventory files at a time to prevent reingestion from multiple sets if there's a failure.
- To prevent processing inventory files from different sets simultaneously, set the
maxFilesPerTrigger
value to1
.
Note
If you use inventory-based ingestion for the initial data load and later switch to BYOS-based ingestion, carefully evaluate before reverting to inventory-based ingestion. This restriction exists because inventory-based ingestion processes data incrementally based on the maximum sourceModifiedAt
value in the bronze lakehouse. If an inventory file has a last modified date earlier than the source modified date generated from a BYOS-based ingestion, the file is skipped. To avoid this issue, make sure the last modified date is later than the one generated in the previous ingestion. Generally, we recommend you follow a consistent ingestion pattern.