Ingest clinical data using healthcare data foundations
The clinical transformation capability deploys as part of the healthcare data foundations. This capability provides ready-to-run data pipelines that efficiently prepare data for analytics and AI/machine learning modeling.
For more information on the deployment and the available artifacts, see:
Essentially, the deployment creates three lakehouses, five notebooks, a Fabric environment, and a clinical data pipeline in your healthcare data solutions environment. This data pipeline ingests clinical data and transforms it from the raw source files into the bronze and silver lakehouses. As explained in Data ingestion patterns, it supports two ingestion patterns - Ingest and Bring Your Own Storage (BYOS). The BYOS ingestion pipeline run is explained in Use Azure Health Data Services - Data export. This article outlines how to use the Ingest pattern to process the clinical sample data provided with healthcare data solutions.
Note
You can also use your own FHIR dataset instead of the clinical sample dataset. However, review the considerations in Usage considerations before doing so.
Prerequisites
- Deploy healthcare data solutions in Microsoft Fabric
- Install the foundational notebooks and pipelines in Deploy healthcare data foundations.
- Deploy the clinical sample data as explained in Deploy sample data.
Move the clinical sample data to the ingestion folder
When you deploy the sample data as explained in Deploy sample data, the clinical sample data files should be available in the unified folder structure under Files\SampleData\Clinical\FHIR-NDJSON\FHIR-HDS\51KSyntheticPatients
in the bronze lakehouse. Use OneLake or Azure Storage Explorer to copy the 51KSyntheticPatients files from Files\SampleData\Clinical\FHIR-NDJSON\FHIR-HDS
to Files\Ingest\Clinical\FHIR-NDJSON\FHIR-HDS
in the bronze lakehouse.
Run the data pipeline
Run the healthcare#_msft_clinical_data_foundation_ingestion data pipeline in the bronze lakehouse. Depending on the clinical sample data size and the Fabric capacity assigned to the workspace, the pipeline execution should complete in an hour. After the pipeline run finishes, you can see that the pipeline ran successfully on the sample data but logged a Failed status for the fhir_ingestion_bronze_ingestion notebook activity.
Validate the data
In real-world scenarios, you'll ingest data from various sources with different levels of quality. The validation engine, introduced in Data validation, intentionally triggers validations on some of the provided clinical sample data. During pipeline execution, the ingestion activity fails due to the intentional invalidity of the sample data. The failed files don't process and move to the Failed folder. All the other valid files process successfully, resulting in an overall green/successful pipeline status.
To investigate the failure, select the icon next to the Failed status under activity status. It provides information on how to locate the error details, along with a sample SQL query based on the runId
value in the admin lakehouse BusinessEvents table. Seven errors appear for this runId
, all due to Last Updated does not exist
. The corresponding failed NDJSON file resides in the Failed folder, with the sourceFilePath
pointing to …/Files/Failed/Clinical/FHIR-NDJSON/FHIR-HDS/2024/10/18/51KSyntheticPatients/1729215337.346439_RiskAssessment.ndjson.zip
.
The successfully processed files leave the Ingest folder (now empty) and move to the Process folder.
You can also explore the ingested data in the bronze lakehouse ClinicalFhir table and the respective FHIR tables in the healthcare data model in the silver lakehouse. Here's a summary of the expected record counts:
Admin lakehouse:
- BusinessEvents table: Seven records
Bronze lakehouse:
- ClinicalFhir table: 33,317,250 records
Files\Ingest\Clinical\FHIR-NDJSON\FHIR-HDS\51KSyntheticPatients
: No filesFiles\Process\Clinical\FHIR-NDJSON\FHIR-HDS\51KSyntheticPatients\YYYY\MM\DD
: 67 filesFiles\Failed\Clinical\FHIR-NDJSON\FHIR-HDS\YYYY\MM\DD\51KSyntheticPatients
: One file
Silver lakehouse:
- Patient table: 47,564 records
- Observation table: 19,726,265 records
- RiskAssessment table: No records
Usage considerations
When ingesting FHIR datasets in healthcare data solutions in Microsoft Fabric, consider the following requirements:
- All data must use NDJSON format.
- Each file must only contain data for a single FHIR resource.
- Each resource in the file requires a metadata field with a valid value for
Meta.LastUpdated
. If this value isn't present, a default validation error occurs as explained in Data validation. - Each resource in the file must have a value for the
ID
field. If this value isn't present, a default validation error occurs as explained in Data validation.