Parallel Writing of Data to ADLS Delta Storage Causes "The specified path already exists.", 409, PUT" Error

Question

I have a Spark (Spark 3.1.2, Scala 2.12) application that reads json records from a table. The records are distributed across cluster executors.

In the application, I have used foreach function to loop through the records in the table. For each record, different transformations take place. After the transformations, the resulted dataframe is being writing to an ADLS storage.

It's possible during the for each loop two executors can write data to the same location at the same time in the ADLS storage (Parallel PUT operation). The application works fine in DEV and PRE-PROD. However, once in a while, I get the following error from the Log4j output. Sometimes, the job succeeded but still output the error in the log4j. Sometimes, the job failed

Error

ERROR AbfsClient: HttpRequest: 409,err=PathAlreadyExists,appendpos=,cid=9f83f144-94b8-4108-8e41-4b753eab3575,rid=e46cbae2-101f-0014-6286-104eea000000,connMs=0,sendMs=0,recvMs=48,sent=0,recv=168,method=PUT,url=https://storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90

ERROR AzureBlobFileSystem:V3: FS_OP_RENAME SRC[abfss://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/__tmp_path_dir/.00000000000000007568.json.8b7fd480-6154-443d-8ae9-b1357cec4e7b.tmp] DST[abfss://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json] Rename failed. AbfsRestOperationException: Operation failed: "The specified path already exists.", 409, PUT, https://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:e46cbae2-101f-0014-6286-104eea000000 Time:2023-11-06T07:57:24.7258168Z"
Operation failed: "The specified path already exists.", 409, PUT, https://container@storageAccount.dfs.core.windows.net/container/parentFolder/childFolder/grandChildFolder/_delta_log/00000000000000007568.json?timeout=90, PathAlreadyExists, "The specified path already exists. RequestId:e46cbae2-101f-0014-6286-104eea000000 Time:2023-11-06T07:57:24.7258168Z"
	at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:261)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.renamePath(AbfsClient.java:355) 	
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.rename(AzureBlobFileSystemStore.java:766) 	
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.renameWithInstrumentation(AzureBlobFileSystem.java:381

Similar question is asked in this link but no solution provided: https://learn.microsoft.com/en-ie/answers/questions/185752/streaming-upserts-constantly-report-mysterious-log

Answer

@Orowole, Ayebogbon-XT -- I understand the problem, by design an default the delta framework does optimistic control when writing data back to the file. However, the error you are facing is not related to the actual table its when the delta framework adds underlying log file which might have already got created via another thread.

I don't think there is any straight forward solution to it.
I believe this is related to Optimised write. This feature allows the delta framework to write less files than more small sized files. By default this is off, which means every time a thread writes something it has to invalidate a file (if its an update) and create a new data file which will then get updated in the checkpoint log file. In your case this seems to be case when the write happens two threads tries to create same checkpoing number file at the same time. You can turn on this feature by delta.autoOptimize.optimizeWrite at the table property. This can also get turned in DeltaFrameworkWriter option.
This might also be cause of autoCompact feature within the delta framework which compresses lots of files into fewer files to improve process which is managed automatically. You can play around this feature and see if that helps you. If you turn off, you need to make sure you maintain the table by manually compacting and vaccuming.

Regarding parquet format, this won't happen if you use parquet format. However you will loose all the nice feature comes along with delta table. Schema evolution, statistics, sql based query, change feed etc etc.

I don't have much knowledge on the data set you are dealing. What you can also do is that you can land the data as parquet and then a seperate process can process the parquet to a delta table (which might be your silver layer).

Mark as answer if this helps!

Share via

Parallel Writing of Data to ADLS Delta Storage Causes "The specified path already exists.", 409, PUT" Error

1 answer

Your answer