Prepare the public datasets in SDOH datasets - Transformations (preview)

[This article is prerelease documentation and is subject to change.]

SDOH public datasets contain aggregated social determinants of health (SDOH) data published by government agencies and other official sources such as universities. These datasets consolidate various SDOH parameters at geographic levels such as state, county, or zip code. SDOH datasets - Transformations (preview) allows you to ingest these geography-level datasets in CSV (Comma-separated values) or XLSX (Excel Open XML Spreadsheet) format and normalize them into a custom data model.

The preview release provides the following eight sample SDOH datasets from various SDOH domains to help you run data pipelines and explore data transformations through the bronze, silver, and gold lakehouse layers:

  • USDA's Food Environment Atlas: Includes factors such as store/restaurant proximity, food prices, nutrition assistance programs, and community characteristics. These factors affect food choices, diet quality, and ultimately, health outcomes.

  • USDA's Rural Atlas: Offers statistics on socio-economic factors such as people, jobs, county classifications, income, and veterans.

  • AHRQ's SDOH Data: Provides details across five key SDOH domains:

    • Social context, such as age, race/ethnicity, veteran status.
    • Economic context, such as income, unemployment rate.
    • Education
    • Physical infrastructure, such as housing, crime, transportation.
    • Healthcare context, such as health insurance.
  • Location Affordability Index: Estimates household housing and transportation costs at the neighborhood level.

  • Environmental Justice Index: Aggregates data from multiple sources to rank the cumulative impacts of environmental injustice on health for every census tract.

  • ACS Education Attainment: Provides education insights for geographic areas, derived from a large, ongoing demographic survey.

  • Australian SEIFA: Combines Australian census data such as income, education, employment, and housing to summarize an area's socio-economic characteristics.

  • U.K. Indices of Deprivation: A widely used socio-economic measure within the United Kingdom to assess poverty across small areas, covering various dimensions.

Where:

  • USDA: United States Department of Agriculture
  • AHRQ: Agency for Healthcare Research and Quality
  • ACS: American Community Survey
  • SEIFA: Socio-Economic Indexes for Areas

Important

These datasets aren't just samples but complete, real datasets published by the respective organizations. They provide an accurate representation of the SDOH profiles of their geographic areas. Be careful when modifying them, as they are official publications from federal agencies.

Folder structure

The landing zone for SDOH datasets - Transformations (preview) comprises three folders: Ingest, Process, and Failed. To learn more about these folders, see Unified folder structure.

Prepare the SDOH datasets before ingestion

Before ingesting SDOH public datasets, ensure they're ready for successful ingestion. The following sections outline two scenarios:

  • Use your own dataset
  • Use the sample dataset

Use your own dataset

SDOH public datasets vary significantly across publishing organizations in format, volume, and structure. They lack an established standard for collecting and exchanging the captured information. Therefore, unifying them into a common shape is essential before representing them within a data model.

To ingest and transform an SDOH public dataset of your choice, add the following three key pieces of information to them:

  • Layout: Due to the absence of a standard set of codes for capturing SDOH data, understanding the meaning of each field is challenging. To resolve this issue, create a data dictionary for the dataset by adding a new sheet named Layout (if your dataset is in XLSX format) or create a new CSV file (if your dataset is in CSV format) with the columns displayed in the following example:

    A screenshot displaying a sample layout sheet.

  • DataSetMetadata: As SDOH datasets come from different publishers, recording key details about the dataset is crucial. Add a new sheet named DataSetMetadata (if your dataset is in XLSX format) or create a new CSV file (if your dataset is in CSV format) with the columns displayed in the following example:

    A screenshot displaying a sample dataset metadata sheet.

  • LocationConfiguration: Different geographies define and organize location data in various ways. To help the SDOH pipelines understand your dataset's geographical structure, add a new sheet named LocationConfiguration (if your dataset is in XLSX format) or create a new CSV file (if your dataset is in CSV format) with the columns displayed in the following example:

    A screenshot displaying a sample location configuration sheet.

Also:

  • You can refer to the structure of the sample SDOH datasets to populate requisite information such as social determinant category, metadata, and harmonization key.
  • If you prefer not to ingest certain fields from the original dataset, either remove them from the data sheet or leave their details blank in the layout sheet. In both cases, they aren't included in the silver data model.
  • Datasets with the same name, published date, and publisher are treated as duplicates.

Use the sample dataset

The sample SDOH datasets provided with healthcare data solutions come prepopulated with all the prerequisite information and are available in your OneLake. You can extract them locally.

Upload datasets to the Fabric workspace

After the datasets are ready, choose one of the following two options to upload them. You can use Option 2 only if you're using the sample dataset provided with SDOH datasets - Transformations (preview).

  • Option 1: Manually upload the datasets.
  • Option 2: Use a script to upload the datasets.

Manually upload the datasets

  1. In your healthcare data solutions environment, select the healthcare#_msft_bronze lakehouse.

  2. Open the Ingest folder. To learn more, see Folder descriptions.

  3. Select the ellipsis (...) beside the folder name and select Upload folder.

  4. Upload the datasets from your local system. Use OneLake file explorer to find the datasets in the following path: <workspace name>\healthcare#.HealthDataManager\DMHSampleData\8SdohPublicDataset.

  5. Refresh the Ingest folder. You should now see the dataset files within the SDOH subfolder.

Use a script to upload the datasets

Important

Use this option only if you're using the provided sample dataset.

  1. Go to your healthcare data solutions Fabric workspace.

  2. Select + New item.

  3. On the New item pane, search and select Notebook.

  4. Copy the following code snippet into the notebook:

    workspace_id = '<workspace_id>' # Workspace ID. Retrieve the value from the healthcare#_msft_config_notebook.
    one_lake_endpoint = "<OneLake_endpoint>" # OneLake endpoint. Retrieve the value from the healthcare#_msft_config_notebook.
    solution_id = "<solution_id>" # Solution ID. Retrieve the value from the healthcare#_msft_config_notebook. 
    bronze_lakehouse_id = "<bronze_lakehouse_id>" # To locate the bronze lakehouse ID, open the bronze lakehouse and check the URL in the browser's address bar: https://{baseurl}/lakehouse/{GUID}/details). The {GUID} value in the URL is the bronze lakehouse ID.
    
    def copy_source_files_and_folders(source_path, destination_path):
       # List the contents of the source directory
       source_contents = mssparkutils.fs.ls(source_path)
    
       # List the contents of the destination directory
       try:
           destination_contents = mssparkutils.fs.ls(destination_path)
           destination_files = {item.path.split('/')[-1]: item.path for item in destination_contents}
       except Exception as e:
           print(f"Destination path {destination_path} does not exist or is empty. Creating the path.")
           destination_files = {}
           mssparkutils.fs.mkdirs(destination_path)
    
       # Copy each item inside the source directory to the destination directory
       for item in source_contents:
           item_path = item.path
           item_name = item_path.split('/')[-1]
           destination_item_path = f"{destination_path}/{item_name}"
    
           if item.isDir:
               # Recursively copy the contents of the directory
               copy_source_files_and_folders(item_path, destination_item_path)
           else:
               if item_name in destination_files:
                   print(f"File already exists, skipping: {destination_item_path}")
               else:
                   print(f"Creating new file: {destination_item_path}")
                   mssparkutils.fs.cp(item_path, destination_item_path, recurse=True)
    
    # Define the source and destination paths with placeholder values
    data_manager_solution_path = f"abfss://{workspace_id}@{one_lake_endpoint}/{solution_id}"
    data_manager_sample_data_path = f"{data_manager_solution_path}/DMHSampleData"
    
    sdoh_csv_data_path = f"{data_manager_sample_data_path}/8SdohPublicDataset/csv"
    sdoh_xlsx_data_path = f"{data_manager_sample_data_path}/8SdohPublicDataset/xlsx"
    
    destination_path_csv = f"abfss://{workspace_id}@{one_lake_endpoint}/{bronze_lakehouse_id}/Files/Ingest/SDOH/CSV"
    destination_path_xlsx = f"abfss://{workspace_id}@{one_lake_endpoint}/{bronze_lakehouse_id}/Files/Ingest/SDOH/XLSX"
    
    # Copy the files along with their parent folders
    copy_source_files_and_folders(sdoh_csv_data_path, destination_path_csv)
    copy_source_files_and_folders(sdoh_xlsx_data_path, destination_path_xlsx)     
    
  5. Run the notebook. The sample SDOH datasets now move to the designated location within the Ingest folder.

The SDOH datasets are now ready for ingestion.