Data Factory - Unable to read zip files from Amazon S3. Error: Central Directory corrupt. Unable to read beyond the end of the stream.

Quyen Dang 21 Reputation points
2025-03-03T15:41:25.5666667+00:00

I am using Azure Data Factory's Copy activity to read csv containing in zip files in Amazon S3.

I am experiencing problems since yesterday about reading only zip files from Amazon S3, other types of files were read all fine (or at least pure csv files).

There was no change to the files at all. No files were corrupted. The same files that was read successfully days ago now also failed, all with the same error: "Central Directory corrupt. Unable to read beyond the end of the stream."

Is there recent change to the read method for zip files for amazon s3 connectors on Azure Data Factory?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,323 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Marcoz Zampieri 75 Reputation points
    2025-03-03T15:54:53.2166667+00:00

    hi Quyen Dang, The error you're encountering, "Central Directory corrupt. Unable to read beyond the end of the stream", typically indicates a problem with reading or extracting ZIP files, and it suggests that the ZIP file structure may be damaged or improperly formatted. In your case, the fact that the issue started occurring suddenly and with previously working files is concerning, and there are a few possible causes you should consider:

    1. Recent Changes to Azure Data Factory or S3 Connectors

    Azure Data Factory (ADF) is constantly updated, and new versions of the connectors can be deployed. These updates could affect how files are read, including changes to the ZIP extraction process.

    • To check if there's been a recent change to the S3 connectors or ADF itself:
    • Review Azure Data Factory Release Notes: You can check if any recent updates have been applied to ADF or the S3 connectors, especially around ZIP file handling. Sometimes, these updates might introduce bugs or changes that are not well-documented initially.
      • Check the ADF Update History: Visit the Azure updates page or ADF's release notes to see if there are any reported issues or new versions that could affect ZIP file handling.
    1. Issues with Amazon S3 Zip File Storage

    Although you mentioned that there were no changes to the files themselves, it's possible that something changed in the way S3 is interacting with ADF. Some S3-related issues, such as network latency or temporary corruption during file upload/download, can result in incomplete or corrupt ZIP files.

    • Verify the ZIP files: Manually download the ZIP files from S3 and try extracting them locally using a tool like WinRAR, 7-Zip, or the built-in unzip function. If the files cannot be extracted properly, it could be an issue with the files themselves, even if you haven't seen any changes.
    1. Try a Different Method of Reading ZIP Files
    • If you're specifically using the Copy Activity to extract ZIP files, you could test an alternative approach to see if the issue persists:
    • Use Azure Functions or Data Flows: Instead of relying on the Copy Activity, you could use an Azure Function to extract the files from the ZIP archives in S3, or use Data Flows (if they are compatible) to handle file extraction and reading.
      • Custom Activity: Use a custom activity in ADF to unzip the files via a programmatic approach, either in C# or Python, for more control over file handling.
    1. Logging and Diagnostics
    • Enable Diagnostic Logs: Make sure you have diagnostic logging enabled for the Azure Data Factory pipeline. This can help you gather more detailed information about where and why the failure is occurring.
    • Monitor Amazon S3 logs: Check your S3 access logs to see if there are any errors or issues when ADF attempts to read the ZIP files. If you have S3 logging enabled, it might provide more context.
    1. Test with Simple ZIP Files
    • Test with a simple ZIP file (a known good one) to confirm whether the issue is with all ZIP files or just specific ones. This can help isolate whether the problem is with the file contents or a more systemic issue with ADF/S3.
    1. File Size Considerations
    • If your ZIP files are particularly large, it's possible that ADF's method of reading ZIP files from S3 might encounter resource limitations or timeouts. In that case, consider checking for any resource allocation or timeout issues in ADF's settings.
    • Split Larger Files: If possible, try splitting the ZIP files into smaller parts to see if the problem persists with smaller files.
    1. Check for Network/Timeout Issues
    • Network issues or timeout problems while reading from S3 could also result in incomplete file reads, leading to ZIP extraction errors. Ensure that the network connection between ADF and S3 is stable and that there are no timeouts occurring during the read process.
    1. Test in Isolation
    • If other file types (like CSV files) are being read correctly, it might indicate that there is a specific issue with how ADF is handling ZIP files from S3. As a test, try isolating the problem by copying just the problematic ZIP file to another storage (e.g., Azure Blob Storage) and trying to read it from there.The error you're encountering, "Central Directory corrupt. Unable to read beyond the end of the stream", typically indicates a problem with reading or extracting ZIP files, and it suggests that the ZIP file structure may be damaged or improperly formatted. In your case, the fact that the issue started occurring suddenly and with previously working files is concerning, and there are a few possible causes you should consider:
      1. Recent Changes to Azure Data Factory or S3 Connectors
      • Azure Data Factory (ADF) is constantly updated, and new versions of the connectors can be deployed. These updates could affect how files are read, including changes to the ZIP extraction process.
      • To check if there's been a recent change to the S3 connectors or ADF itself:
        • Review Azure Data Factory Release Notes: You can check if any recent updates have been applied to ADF or the S3 connectors, especially around ZIP file handling. Sometimes, these updates might introduce bugs or changes that are not well-documented initially.
        • Check the ADF Update History: Visit the Azure updates page or ADF's release notes to see if there are any reported issues or new versions that could affect ZIP file handling.
      1. Issues with Amazon S3 Zip File Storage
      • Although you mentioned that there were no changes to the files themselves, it's possible that something changed in the way S3 is interacting with ADF. Some S3-related issues, such as network latency or temporary corruption during file upload/download, can result in incomplete or corrupt ZIP files.
      • Verify the ZIP files: Manually download the ZIP files from S3 and try extracting them locally using a tool like WinRAR, 7-Zip, or the built-in unzip function. If the files cannot be extracted properly, it could be an issue with the files themselves, even if you haven't seen any changes.
      1. Try a Different Method of Reading ZIP Files
      • If you're specifically using the Copy Activity to extract ZIP files, you could test an alternative approach to see if the issue persists:
        • Use Azure Functions or Data Flows: Instead of relying on the Copy Activity, you could use an Azure Function to extract the files from the ZIP archives in S3, or use Data Flows (if they are compatible) to handle file extraction and reading.
        • Custom Activity: Use a custom activity in ADF to unzip the files via a programmatic approach, either in C# or Python, for more control over file handling.
      1. Logging and Diagnostics
      • Enable Diagnostic Logs: Make sure you have diagnostic logging enabled for the Azure Data Factory pipeline. This can help you gather more detailed information about where and why the failure is occurring.
      • Monitor Amazon S3 logs: Check your S3 access logs to see if there are any errors or issues when ADF attempts to read the ZIP files. If you have S3 logging enabled, it might provide more context.
      1. Test with Simple ZIP Files
      • Test with a simple ZIP file (a known good one) to confirm whether the issue is with all ZIP files or just specific ones. This can help isolate whether the problem is with the file contents or a more systemic issue with ADF/S3.
      1. File Size Considerations
      • If your ZIP files are particularly large, it's possible that ADF's method of reading ZIP files from S3 might encounter resource limitations or timeouts. In that case, consider checking for any resource allocation or timeout issues in ADF's settings.
      • Split Larger Files: If possible, try splitting the ZIP files into smaller parts to see if the problem persists with smaller files.
      1. Check for Network/Timeout Issues
      • Network issues or timeout problems while reading from S3 could also result in incomplete file reads, leading to ZIP extraction errors. Ensure that the network connection between ADF and S3 is stable and that there are no timeouts occurring during the read process.
      1. Test in Isolation
      • If other file types (like CSV files) are being read correctly, it might indicate that there is a specific issue with how ADF is handling ZIP files from S3. As a test, try isolating the problem by copying just the problematic ZIP file to another storage (e.g., Azure Blob Storage) and trying to read it from there.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.