Azure Synapse pipeline

Lotus88 46

Hi,

I have a question regarding Synapse. I want to create a pipeline that can delta update the target table “quotes_target” from source table “quotes_source” with filtered quotes data. I have a list of quotes id of the updated quotes after I extracted it from CDC tables. I populated them in a parquet file. I want to use a Copy activity to copy the updated quotes from “quotes_source” table to “quotes_target” table. However I am stuck how can I filtered the source data by the list of quotes id in parquet file?

Can anyone help? Thank you!

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-01-20T17:58:02.7033333+00:00

@Lotus88

Just checking in to see if the below answer provided by @ Vinodh247 helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

2 answers

Vinodh247 27,871 Reputation points MVP

2025-01-17T17:02:03.7766667+00:00
Hi ,

Thanks for reaching out to Microsoft Q&A.

To filter the source data by the list of quote IDs in your parquet file and copy only the updated quotes from the quotes_source table to the quotes_target table in synapse, you can try the below steps:

Step 1: Load the Parquet File to a Staging Table

Create a staging table in your database to temporarily hold the quote IDs from the Parquet file. For example: --> CREATE TABLE quotes_id_staging (quote_id INT);

Use a Copy activity in your synapse pipeline to load the parquet file into this quotes_id_staging table. Configure the source dataset as the parquet file and the sink dataset as the staging table.

Step 2: Filter Data from the Source Table

Once the quotes_id_staging table is populated with the list of quote_ids, use a Mapping Data Flow or Stored Procedure to filter the quotes_source table based on the list of quote_ids.

Option 1: Use a Mapping Data Flow

Create a Data Flow in your Synapse pipeline.

Add a source transformation to read data from the quotes_source table.

Add another source transformation to read data from the quotes_id_staging table.

Use a Join transformation:

Join the quotes_source table and the quotes_id_staging table on the quote_id column.

Select the Inner Join type to filter only matching quote_ids.

Add a sink transformation to write the filtered data to the quotes_target table.

Option 2: Use a stored procedure

You can write a SQL query to perform the filtering and updating in the database using a MERGE statement. Add a Stored Procedure activity to the pipeline to execute this SQL. Here's an example:

MERGE INTO quotes_target AS target USING ( SELECT qs.* FROM quotes_source qs INNER JOIN quotes_id_staging qids ON qs.quote_id = qids.quote_id ) AS source ON target.quote_id = source.quote_id WHEN MATCHED THEN UPDATE SET target.column1 = source.column1, target.column2 = source.column2, -- Add all columns you want to update target.updated_at = GETDATE() WHEN NOT MATCHED THEN INSERT (quote_id, column1, column2, created_at) VALUES (source.quote_id, source.column1, source.column2, GETDATE());

Optionally, truncate the quotes_id_staging table after the pipeline run to keep it clean for future updates. This approach ensures scalability and flexibility while leveraging the capabilities of synapse for incremental updates.

Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.
Please sign in to rate this answer.
Lotus88 46 Reputation points

2025-01-18T01:13:07.7766667+00:00

Hi Vinodh, I did try data flow. I want to do upsert I.e. insert record if it does not exists in target table otherwise do update. There is a condition to enter for update, upsert and insert. I don't understand what I can enter in the condition. Do you have example ? Thank you!

Vinodh247 27,871 Reputation points MVP

2025-01-18T17:00:08.1933333+00:00

In Synapse Data Flow, performing an Upsert (insert or update) requires the use of the Alter Row transformation, where you define conditions for Insert, Update, and other actions.

Here's how you can configure it step by step...

Example Scenario

Tables:

Source Table: quotes_source

Target Table: quotes_target

Logic:

Update: If the quote_id exists in the target table and the updated_at in the source is more recent than in the target.

Insert: If the quote_id does not exist in the target table.

Steps in Synapse Data Flow

Add Source for quotes_source:

Add a Source transformation for the quotes_source table.

Add Source for quotes_target:

Add another Source transformation for the quotes_target table.

Join Source and Target:

Add a Join transformation to join the quotes_source and quotes_target tables using quote_id as the key.

Join condition:-->source.quote_id == target.quote_id

Choose Full Outer Join to capture both matching and non-matching rows.

Add Alter Row Transformation:

Add an Alter Row transformation to define conditions for Insert and Update.

Insert Condition: --> isNull(target.quote_id) // Target doesn't have the quote_id

Update Condition --> !isNull(target.quote_id) && source.updated_at > target.updated_at

You can leave other rows (neither Insert nor Update) untouched by not setting conditions for them.

Add Sink for Target Table:

Add a Sink transformation for the quotes_target table.

In the Sink settings, choose the Allow Upsert option and map the Insert and Update conditions accordingly.

Lotus88 46 Reputation points

2025-01-21T08:09:02.02+00:00

Hi, so for Step 1, I still need to load the parquet file to staging table in order to filter my source data. Is there a way I can use the parquet file to filter without loading to a table?

Lotus88 46 Reputation points

2025-01-21T08:12:22.2766667+00:00

Hi Vinodh, so for Step 1, I can only load the parquet file to a staging table before using it to filter the source data right? Is there a way to use the parquet file directly for filtering ?

Lotus88 46 Reputation points

2025-01-21T08:22:50.95+00:00

May I know where did you create your staging table ? In SQL pool or sql microsoft DB ?

Lotus88 46 Reputation points

2025-01-21T10:04:13.88+00:00

I try to inner join the 2 tables by quote id and I got this error => errorMessage=DF-JOIN-010 at Join 'join1': Non-equality joins should have broadcasted one of the join sides

Both quote id are bigint data type. Can I still use inner join ?

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-01-24T22:56:15.11+00:00

@Lotus88

Can I use the Parquet file directly for filtering instead of loading it into a staging table?

Yes, you can use the Parquet file directly without loading it into a staging table. This can be done using a Mapping Data Flow in Azure Synapse.

Add a Source transformation to read the Parquet file directly. Add another Source transformation for your quotes_source table. Use a Join transformation to join the two sources based on the quote_id column. Set the Join condition to match the quote_id values from both sources (e.g., source_parquet.quote_id == source_table.quote_id). Use the resulting dataset to filter and write the data to the quotes_target table.

This eliminates the need for a staging table while leveraging the Parquet file directly as a source in your data flow.

Where should I create my staging table?

If you prefer to use a staging table, you can create it in:

Dedicated SQL Pool - Ideal for high-performance and large datasets.

Serverless SQL Pool or Azure SQL Database - Suitable for smaller datasets or less resource-intensive operations. The choice depends on your workload and performance requirements.

Why am I getting the error DF-JOIN-010 in a Join transformation?

The error DF-JOIN-010 indicates a non-optimized join. This happens when neither side of the join is broadcasted. To fix this:

Set the smaller dataset (e.g., the Parquet file) as Broadcast in the Join settings of your Data Flow. Ensure that the quote_id columns in both sources are of the same data type (BIGINT in your case) to avoid unnecessary type conversions.

Can I use an Inner Join with BIGINT data types?

Yes, Inner Joins work perfectly with BIGINT data types as long as both sides have the same type and distribution is handled properly. Broadcasting the smaller dataset ensures optimal performance.

I hope this information helps.

Thank you.

Lotus88 46 Reputation points

2025-01-28T03:33:57.2833333+00:00

Hi,

I tried to load the changed records "quote_id" to a table "ChangedQuotes" in Azure SQL DB.

Then I did an inner join of ChangedQuotes with a view "MainSource" that contain all quotes.

The quote_id is a unique key with bigint data type.

However I keep getting this error below:

"Job failed due to reason: at Sink 'SinkQuoteCache': Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression"

I checked the data in my DB using the sql below and return 5 unique records. So I don't understand why I still have this error? Can anyone advise ? Thank you!

select * from MainSource where quote_id in ('1000', '2000', '3000', '4000', '5000')

Lotus88 46 Reputation points

2025-01-28T04:24:23.22+00:00

Hi,

I tried to load the changed records "quote_id" to a table "ChangedQuotes" in Azure SQL DB.

Then I did an inner join of ChangedQuotes with a view "MainSource" that contain all quotes.

The quote_id is a unique key with bigint data type.

However I keep getting this error below:

"Job failed due to reason: at Sink 'SinkQuoteCache': Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression"

I checked the data in my DB using the sql below and return 5 unique records. So I don't understand why I still have this error? Can anyone advise ? Thank you!

select * from MainSource where quote_id in ('1000', '2000', '3000', '4000', '5000')

Lotus88 46 Reputation points

2025-01-28T04:35:26.26+00:00

I am using upsert condition.

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-01-29T19:32:17.2866667+00:00

@Lotus88

The error message "Subquery returned more than 1 value" typically occurs when a subquery that is expected to return a single value actually returns multiple values. This can often be traced back to the structure of your query or data flow logic. Here are some steps you can take to troubleshoot and resolve this issue:

Verify Uniqueness of Join Keys - Double-check that the quote_id in both the ChangedQuotes table and the MainSource view is indeed unique. While your SQL query confirms uniqueness, ensure that no duplicates are introduced during the data processing stages in your pipeline.

Examine Join Logic - If you're using a Mapping Data Flow, ensure that the join between ChangedQuotes and MainSource is set up correctly. Make sure the join condition is precise to avoid creating a Cartesian product, which could lead to multiple matches for a single quote_id.

Check Data Flow Expressions - If you are using expressions in your data flow that involve subqueries, ensure they are structured to return a single value. Avoid expressions that might evaluate to multiple values in the context they are being used.

Review Upsert Logic - Ensure that your upsert logic is correctly implemented. The error may relate to how the data flow or SQL logic attempts to perform update or insert operations. Verify that the conditions for updates and inserts are clearly defined to prevent ambiguity.

Debug and Log - Utilize the debugging and logging features in Azure Synapse to track the flow of data and identify where the error might be occurring. This can provide insights into how the data is processed and where duplicates may be introduced.

Isolate the Problem - If the issue persists, consider isolating parts of your logic by executing them separately. This can help identify where the multiple values originate, allowing you to adjust the logic accordingly to ensure single-value expressions where required.

I hope this information helps.

Thank you.

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-01-30T18:57:10.51+00:00

@Lotus88

Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

Lotus88 46 Reputation points

2025-02-03T03:32:29.92+00:00

Hi,

I tried to restrict the number of records from MainSource view.

"select * from mainSource_view where updated_date > '{addDays(currentUTC(), -7)}'"

My data flow is very simple. I don't understand why I still encountered error.

When I clicked data preview, I can see the data retrieved.

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-02-03T14:00:00.81+00:00

@Lotus88

The error you're encountering in your Data Flow is likely due to column references and schema mismatches. Here are some key issues from the error and troubleshooting steps that might help you:

key issues from the error:

Issue Explanation Column Not Found - The error indicates that columns like quote_item_id, region, or form_number are either missing from your mainSource_view dataset or are not mapped correctly in the Data Flow.

Store Not Defined - This suggests that the source or dataset in your Data Flow might not be configured properly, or there's a mismatch between the schema defined in the dataset and the actual output of the query.

Why Data Preview Works - Data Preview uses a limited dataset and doesn't enforce all runtime checks. During runtime, the entire schema and data validation are applied, causing these errors.

Troubleshooting steps to fix the issue:

Verify Your Query: Run this query in your source database to confirm all required columns exist and are correctly populated:

SELECT * FROM mainSource_view WHERE updated_date > DATEADD(DAY, -7, GETUTCDATE());

Check Column Mapping - In the Data Flow, ensure that the source transformation includes all necessary columns (quote_item_id, region, form_number). Use a Select transformation after the source to explicitly include and rename the columns if needed.

Revalidate Schema - In the Source transformation, click Refresh Schema to ensure that the dataset schema matches the underlying view. Check downstream transformations for schema mismatches after refreshing.

Simplify the Query - Temporarily simplify the query to test:

SELECT TOP 10 * FROM mainSource_view;

If this works, gradually add back the conditions to isolate the problem.

Debugging Tips - Enable Debug Mode in the pipeline and step through each transformation to identify where the error originates. Add a Select transformation after the source to explicitly map and verify column names.

I hope this information helps.

Lotus88 46 Reputation points

2025-02-04T01:41:27.0933333+00:00

Actually I enabled "Debug Mode' but sometime cannot really identify where the error originates. Seems like whole data flow is red. Let me try again. Thank you!

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-02-06T01:08:21.78+00:00

@Lotus88

Thank you for sharing the update! I understand how challenging it can be to pinpoint errors, especially when the entire data flow is marked red. Enabling "Debug Mode" is definitely a step in the right direction.

If the issue persists, I'd recommend checking the error details in the activity logs or monitoring the input/output data for each transformation step. Sometimes, isolating smaller parts of the flow can help identify the root cause.

Thank you.

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-02-07T01:14:01.8833333+00:00

@Lotus88

Following up to see if the above suggestion was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Lotus88 46 Reputation points

2025-02-12T07:07:48.58+00:00

Hi,

The error seems misleading by telling me "Column Not Found...". Actually, this happen when I tried to pass a parameter to the data flow. if I removed the parameter then it works.

Lotus88 46 Reputation points

2025-02-12T10:03:40.3333333+00:00

Hi,

I tried to add a parameter in data flow.

On the pipeline calling the data flow, I passed another value.

But in my data flow sink, the parameter value don't seems to populate to the "sys_update_dt field". My record is updated using upsert. Will this field "sys_update_dt" or all fields in the updated record be updated to sink ?

Thank you!
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-02-12T10:46:27.5+00:00

@Lotus88

It looks like you're passing the p_extract_end_dt parameter from the pipeline to the data flow, but the value might not be properly applied to the sys_update_dt field during the upsert process in the sink.

Understanding the Upsert Behavior in Data Flow

When using upsert in an Azure Data Factory (ADF) Data Flow, the update behavior depends on whether the column is explicitly mapped and modified during the transformation.

Here are few possible causes & solutions:

Ensure Parameter Mapping in Data Flow - Inside the data flow, you need to explicitly assign sys_update_dt to p_extract_end_dt in a Derived Column transformation before the sink. If you didn’t do this, sys_update_dt might retain its original value from the source instead of being updated.

Check Sink Mapping - Open the Sink transformation and verify that sys_update_dt is mapped correctly. If it’s missing, manually map it to the correct column.

Upsert Behavior - In an upsert operation, only explicitly modified columns will be updated in the target table. If sys_update_dt is not treated as an update, it may retain its existing value.

Check for "Alter Row" Transformation (if any) - If you're using an Alter Row transformation to control updates, ensure it allows the sys_update_dt column to be modified.

Expected Behavior

If mapped correctly, sys_update_dt should be updated to 2025-02-12 16:09:09.000. If not mapped, sys_update_dt may retain its previous value or not be updated.

I hope this information helps.

Thank you.
Please sign in to rate this answer.
Lotus88 46 Reputation points

2025-02-13T05:52:23.9233333+00:00

Hi,

This is a new field that is derived from the passed in parameter.

I have also explicitly mapped the 2 new fields to sink. The sink table also have this fields. Source don't have this fields as it is a new derived fields. One is in text exact value. The other is date time. But still not work. Is there anything I did incorrect ?

Thank you!

Lotus88 46 Reputation points

2025-02-13T06:45:19.32+00:00

Hi,

I added 2 derived fields (p_sysUpdateDt in text data type, sysUpdateDt in datetime data type) which does not exist in Source data but it exists in the Sink table.

These 2 fields should always changed when ever there is update. Now I hard code a value to see if it is can be updated but seems not the case.

The 2 fields are also explicitly mapped to the Sink table. So I don’t know what when wrong here?

Thank you!

Lotus88 46 Reputation points

2025-02-13T07:35:24.48+00:00

My upsert condition is this, is this correct ?

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-02-14T03:43:11.8333333+00:00

@Lotus88

Your current Upsert condition is set to true(), which means every row will be upserted (inserted if not found, updated if found).

I agree that this issue looks strange, and I wasn't able to reproduce this issue. If you have a support plan, could you please file a support ticket for deeper investigation and do share the SR# with us?

Lotus88 46 Reputation points

2025-02-14T04:15:32.7733333+00:00

Hi,

I think is it related to the derived column format that I set. I tried these 2 expressions but the timezone conversion is not working. Do you have any idea why ? Thank you!

fromUTC(toTimestamp($p_extract_end_dt, 'yyyy-MM-dd HH:mm:ss.SSS'), 'Singapore') fromUTC(toTimestamp($p_extract_end_dt, 'yyyy-MM-dd HH:mm:ss.SSS'), 'SGT')

Lotus88 46 Reputation points

2025-02-14T04:16:41.3833333+00:00

Hi,

I tried this expression on my derived column in data flow but it is not doing any conversion to the timezone I want. Do you know why ? Thank you!

fromUTC(toTimestamp($p_extract_end_dt, 'yyyy-MM-dd HH:mm:ss.SSS'), 'Singapore') fromUTC(toTimestamp($p_extract_end_dt, 'yyyy-MM-dd HH:mm:ss.SSS'), 'SGT')

Chandra Boorla 8,230 Reputation points Microsoft Vendor

2025-02-14T11:03:37.0433333+00:00

@Lotus88

Thank you for your follow up question.

The expression you used does not convert the timestamp correctly because Azure Data Flow does not support Windows time zone names like "Singapore Standard Time".

Ensure Correct Time Zone String Singapore time zone should be Asia/Singapore, not just "Singapore" or "SGT". Use the full, valid time zone name.

Correct Expression:

fromUTC(toTimestamp($p_extract_end_dt, 'yyyy-MM-dd HH:mm:ss.SSS'), 'Asia/Singapore')

How to Verify the Conversion?

To check if the conversion is working correctly, you can format the output:

toString(fromUTC(toTimestamp($p_extract_end_dt, 'yyyy-MM-dd HH:mm:ss.SSS'), 'Asia/Singapore'), 'yyyy-MM-dd HH:mm:ss')

This will display the converted timestamp.

Please refer to MS Q&A thread - I would like to know about timestamps in derived columns of synapse dataflow addressing similar issue.

I hope this information helps.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Azure Synapse pipeline

2 answers

Your answer