Prepare the public datasets in SDOH datasets - Transformations

Article
2025-03-31

SDOH public datasets contain aggregated social determinants of health (SDOH) data published by government agencies and other official sources such as universities. These datasets consolidate various SDOH parameters at geographic levels such as state, county, or zip code. The SDOH datasets - Transformations capability lets you ingest these geography-level datasets in CSV (Comma-separated values), XLSX (Excel Open XML Spreadsheet), or XLS (Microsoft Excel spreadsheet) formats and normalize them into a custom data model.

The current release provides the following eight sample SDOH datasets from various SDOH domains to help you run data pipelines and explore data transformations through the bronze, silver, and gold lakehouse layers:

USDA's Food Environment Atlas: Includes factors such as store/restaurant proximity, food prices, nutrition assistance programs, and community characteristics. These factors affect food choices, diet quality, and ultimately, health outcomes.
USDA's Rural Atlas: Offers statistics on socio-economic factors such as people, jobs, county classifications, income, and veterans.
AHRQ's SDOH Data: Provides details across five key SDOH domains:
- Social context, such as age, race/ethnicity, veteran status.
- Economic context, such as income, unemployment rate.
- Education
- Physical infrastructure, such as housing, crime, transportation.
- Healthcare context, such as health insurance.
Location Affordability Index: Estimates household housing and transportation costs at the neighborhood level.
Environmental Justice Index: Aggregates data from multiple sources to rank the cumulative impacts of environmental injustice on health for every census tract.
ACS Education Attainment: Provides education insights for geographic areas, derived from a large, ongoing demographic survey.
Australian SEIFA: Combines Australian census data such as income, education, employment, and housing to summarize an area's socio-economic characteristics.
U.K. Indices of Deprivation: A widely used socio-economic measure within the United Kingdom to assess poverty across small areas, covering various dimensions.

Where:

USDA: United States Department of Agriculture
AHRQ: Agency for Healthcare Research and Quality
ACS: American Community Survey
SEIFA: Socio-Economic Indexes for Areas

Important

These datasets aren't just samples but complete, real datasets published by the respective organizations. They provide an accurate representation of the SDOH profiles of their geographic areas. Be careful when modifying them, as they are official publications from federal agencies.

Folder structure

The landing zone for SDOH datasets - Transformations comprises three folders: Ingest, Process, and Failed. To learn more about these folders, see Unified folder structure.

Prepare the SDOH datasets before ingestion

Before ingesting SDOH public datasets, ensure they're ready for successful ingestion. The following sections outline two scenarios:

Use your own dataset
Use the sample dataset

Use your own dataset

SDOH public datasets vary significantly across publishing organizations in format, volume, and structure. They lack an established standard for collecting and exchanging the captured information. Therefore, unifying them into a common shape is essential before representing them within a data model.

To ingest and transform an SDOH public dataset of your choice, add the following three key pieces of information to them:

Layout: Due to the absence of a standard set of codes for capturing SDOH data, understanding the meaning of each field is challenging. To resolve this issue, create a data dictionary for the dataset by adding a new sheet named Layout (if your dataset is in XLSX or XLS format) or create a new CSV file (if your dataset is in CSV format) with the columns displayed in the following example:

Column	Description
SocialDeterminantName	Contains the column names representing social determinant values in the raw dataset.
Category	Specify the desired category of the social determinant, such as Housing or Access. These values are often reflected in the sheet names of the raw dataset.
SubCategory	Provide the subcategory of the social determinant, like Household Resources or Food, representing access to household resources and food.
SocialDeterminantDescription	Describe the meaning of the social determinant. Since column names in the raw dataset may use short codes, this field clarifies their meaning.
Units	Specify the unit of the social determinant.
HarmonizationKey	Add an annotation or tag representing the category of each SDOH metric based on your business logic. Unlike Category, which is often determined by the publisher, the harmonization key allows you to group social determinants. For example, you might assign Demographics as a harmonization key for Total weighted population, even if the publisher categorizes it as Supplemental data-state.

DataSetMetadata: As SDOH datasets come from different publishers, recording key details about the dataset is crucial. Add a new sheet named DataSetMetadata (if your dataset is in XLSX format) or create a new CSV file (if your dataset is in CSV format) with the columns displayed in the following example:

Column	Description
DataSetName	Name of the original dataset.
PublisherName	Name of the dataset publisher.
PublishedDate	Date the dataset was published. If the exact day or month is unavailable, use 1st or January as the values.
ValidUntil	Expiration date of the dataset. When publishers release a new version, the previous version usually becomes invalid. Some publishers may explicitly specify the dataset's validity period.

LocationConfiguration: Different geographies define and organize location data in various ways. To help the SDOH pipelines understand your dataset's geographical structure, add a new sheet named LocationConfiguration (if your dataset is in XLSX or XLS format) or create a new CSV file (if your dataset is in CSV format) with the columns displayed in the following example:

Specify all the location-related columns in this sheet to retrieve them later in the lakehouses. This way, you can aggregate the social determinant information at various location levels as appropriate. Use this guidance to understand the type of information expected in each of the columns:

Column	Description
ColumnName	Name of the location-related column as depicted in the raw dataset.
StandardColumnName	The standard name for the location you wish to use in the silver lakehouse. It's important to have this standardization as different datasets use different column names to depict the same information. For example, Dataset A uses `FIPS` to depict County FIPS and Dataset B uses `FIPSCD` to depict County FIPS. Assigning the same value, such as COUNTYFIPS, for StandardColumnName to both fields lets you retrieve the social determinant information from these disjoint datasets based on the common attribute County FIPS.
AssociatedWithSDOHValue	A flag to understand whether the social determinant value is calculated at that location-level. There's always one row with the flag value set to `TRUE`.

Also:

You can refer to the structure of the sample SDOH datasets to populate requisite information such as social determinant category, metadata, and harmonization key.
If you prefer not to ingest certain fields from the original dataset, either remove them from the data sheet or leave their details blank in the layout sheet. In both cases, they aren't included in the silver data model.
Datasets with the same name, published date, and publisher are treated as duplicates.

Use the sample dataset

The sample SDOH datasets provided with healthcare data solutions come prepopulated with all the prerequisite information and are available in your OneLake. You can extract them locally.

Upload datasets to the Fabric workspace

After the datasets are ready, choose one of the following two options to upload them. You can use Option 2 only if you're using the sample dataset provided with SDOH datasets - Transformations.

Option 1: Manually upload the datasets.
Option 2: Use a script to upload the datasets.

Manually upload the datasets

In your healthcare data solutions environment, select the healthcare#_msft_bronze lakehouse.
Open the Ingest folder. To learn more, see Folder descriptions.
Select the ellipsis (...) beside the folder name and select Upload folder.
Upload the datasets from your local system. Use OneLake file explorer to find the datasets in the following path: <workspace name>\healthcare#.HealthDataManager\DMHSampleData\8SDOHPublicDatasets.
Refresh the Ingest folder. You should now see the dataset files within the SDOH subfolder.

Use a script to upload the datasets

Important

Use this option only if you're using the provided sample dataset.

Go to your healthcare data solutions Fabric workspace.
Select + New item.
On the New item pane, search and select Notebook.

Copy the following code snippet into the notebook:


# Workspace Id (Retrieve the value from the healthcare#_msft_config_notebook)
workspace_id = "<workspace_id>"

# One lake endpoint (Retrieve the value from the healthcare#_msft_config_notebook)
one_lake_endpoint = "<OneLake_endpoint>"

# Bronze lakehouse ID
# To locate the bronze lakehouse ID, open the bronze lakehouse and check the URL in the browser's address bar: https://{baseurl}/lakehouse/{GUID}/details). The {GUID} value in the URL is the bronze lakehouse ID.
bronze_lakehouse_id = "<bronze_lakehouse_id>"

def copy_source_files_and_folders(source_path, destination_path):
    # List the contents of the source directory
    source_contents = mssparkutils.fs.ls(source_path)
    # List the contents of the destination directory
    try:
        destination_contents = mssparkutils.fs.ls(destination_path)
        destination_files = {item.path.split('/')[-1]: item.path for item in destination_contents}
    except Exception as e:
        print(f"Destination path {destination_path} does not exist or is empty. Creating the path.")
        destination_files = {}
        mssparkutils.fs.mkdirs(destination_path)
    # Copy each item inside the source directory to the destination directory
    for item in source_contents:
        item_path = item.path
        item_name = item_path.split('/')[-1]
        destination_item_path = f"{destination_path}/{item_name}"
        if item.isDir:
           # Recursively copy the contents of the directory
           copy_source_files_and_folders(item_path, destination_item_path)
        else:
           if item_name in destination_files:
              print(f"File already exists, skipping: {destination_item_path}")
           else:
              print(f"Creating new file: {destination_item_path}")
              mssparkutils.fs.cp(item_path, destination_item_path, recurse=True)

# Define the source and destination paths with placeholder values
sdoh_csv_data_path = f"abfss://{workspace_id}@{one_lake_endpoint}/{bronze_lakehouse_id}/Files/SampleData/SDOH/CSV"
sdoh_xlsx_data_path = f"abfss://{workspace_id}@{one_lake_endpoint}/{bronze_lakehouse_id}/Files/SampleData/SDOH/XLSX"
destination_path_csv = f"abfss://{workspace_id}@{one_lake_endpoint}/{bronze_lakehouse_id}/Files/Ingest/SDOH/CSV"
destination_path_xlsx = f"abfss://{workspace_id}@{one_lake_endpoint}/{bronze_lakehouse_id}/Files/Ingest/SDOH/XLSX"

# Copy the files along with their parent folders
copy_source_files_and_folders(sdoh_csv_data_path, destination_path_csv)
copy_source_files_and_folders(sdoh_xlsx_data_path, destination_path_xlsx)

Run the notebook. The sample SDOH datasets now move to the designated location within the Ingest folder.

The SDOH datasets are now ready for ingestion.

Share via