Gzip file uploaded to Azure Blob Storage, when downloaded to Windows machine does not open and the file size is not same

Arvind V 1 Reputation point
2025-01-27T13:46:41.1433333+00:00

I am uploading a csv file with gzip encoded to Azure Blob Storage using C# .net 8. I am able to read the file ok when I download file and read data use .Net code.

But I have two problems.

  1. When I download the file to my local Windows laptop and try to open the file, I get an error "Windows cannot open the file (Archive is invalid)".
  2. The size of the file on Blob storage is 4.9 KIB, but the downloaded file is 12 KB (while Kib to KB should be 1.024 times)

Also, when I try to read this file using Azure data bricks it is not able to recognize this as a gzip file (but the same data file if I gzip in Windows and upload to blob storage it works)

But on the Azure blob storage explorer I can open the file and view the data. Below is the code I use for uploading the file.

public async Task SaveAsync
 (
     IEnumerable<MyData> data,
     string containerName, string blobName,
     CancellationToken cancellationToken
 )
 {
     using var ms = new MemoryStream();

     var containerClient = _blobServiceClient.GetBlobContainerClient(containerName);
     await containerClient.CreateIfNotExistsAsync();

     var blobClient = containerClient.GetBlobClient(blobName);

     await using var compress = new GZipStream
     (
         ms,
         CompressionMode.Compress,
         true
     );

     await using var writer = new StreamWriter(compress);

     await using var csv = new CsvWriter
     (
         writer,
         CultureInfo.InvariantCulture,
         true
     );

     csv.Context.RegisterClassMap<MyData>();
     await csv.WriteRecordsAsync(data.OrderBy(x => x.Date), cancellationToken);
     await writer.FlushAsync(cancellationToken);
     
     await ms.FlushAsync(cancellationToken);
     ms.Position = 0;

     var blobHttpHeader = new BlobHttpHeaders
     {
         ContentType = "application/csv",
         ContentEncoding = "gzip",
     };

     IDictionary<string, string> metaData = new Dictionary<string, string>();
     metaData.Add("date", DateTime.UtcNow.ToString(CultureInfo.InvariantCulture));

     await blobClient.UploadAsync
     (
         ms,
         blobHttpHeader,
         metaData,
         null,
         null,
         null,
         default,
         cancellationToken
     );
 }

.NET
.NET
Microsoft Technologies based on the .NET software framework.
4,086 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,055 questions
C#
C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
11,269 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 28,376 Reputation points
    2025-01-29T08:53:20.1066667+00:00

    You may need to verify that GZip stream is properly disposed and flushed before setting ms.Position = 0 and the compress.DisposeAsync() is called before using the MemoryStream for upload. I think the correct Content-Type should be (application/gzip) in the headers.

    For the size, I have a doubt that if the upload process already compresses the file and Azure Storage applies another layer of encoding, the file may be compressed twice and that's a problem.

    If Databricks doesn't recognize the file as a gzip, I think this is a corruption issue, try to check :

    df = spark.read.csv("dbfs:/mnt/yourblobstorage/yourfile.csv.gz", header=True, inferSchema=True)
    
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.