Gzip file uploaded to Azure Blob Storage, when downloaded to Windows machine does not open and the file size is not same

Question

I am uploading a csv file with gzip encoded to Azure Blob Storage using C# .net 8. I am able to read the file ok when I download file and read data use .Net code.

But I have two problems.

When I download the file to my local Windows laptop and try to open the file, I get an error "Windows cannot open the file (Archive is invalid)".
The size of the file on Blob storage is 4.9 KIB, but the downloaded file is 12 KB (while Kib to KB should be 1.024 times)

Also, when I try to read this file using Azure data bricks it is not able to recognize this as a gzip file (but the same data file if I gzip in Windows and upload to blob storage it works)

But on the Azure blob storage explorer I can open the file and view the data. Below is the code I use for uploading the file.

public async Task SaveAsync
 (
     IEnumerable data,
     string containerName, string blobName,
     CancellationToken cancellationToken
 )
 {
     using var ms = new MemoryStream();

     var containerClient = _blobServiceClient.GetBlobContainerClient(containerName);
     await containerClient.CreateIfNotExistsAsync();

     var blobClient = containerClient.GetBlobClient(blobName);

     await using var compress = new GZipStream
     (
         ms,
         CompressionMode.Compress,
         true
     );

     await using var writer = new StreamWriter(compress);

     await using var csv = new CsvWriter
     (
         writer,
         CultureInfo.InvariantCulture,
         true
     );

     csv.Context.RegisterClassMap();
     await csv.WriteRecordsAsync(data.OrderBy(x => x.Date), cancellationToken);
     await writer.FlushAsync(cancellationToken);
     
     await ms.FlushAsync(cancellationToken);
     ms.Position = 0;

     var blobHttpHeader = new BlobHttpHeaders
     {
         ContentType = "application/csv",
         ContentEncoding = "gzip",
     };

     IDictionary metaData = new Dictionary();
     metaData.Add("date", DateTime.UtcNow.ToString(CultureInfo.InvariantCulture));

     await blobClient.UploadAsync
     (
         ms,
         blobHttpHeader,
         metaData,
         null,
         null,
         null,
         default,
         cancellationToken
     );
 }

Answer

You may need to verify that GZip stream is properly disposed and flushed before setting ms.Position = 0 and the compress.DisposeAsync() is called before using the MemoryStream for upload. I think the correct Content-Type should be (application/gzip) in the headers.

For the size, I have a doubt that if the upload process already compresses the file and Azure Storage applies another layer of encoding, the file may be compressed twice and that's a problem.

If Databricks doesn't recognize the file as a gzip, I think this is a corruption issue, try to check :

df = spark.read.csv("dbfs:/mnt/yourblobstorage/yourfile.csv.gz", header=True, inferSchema=True)

Share via

Gzip file uploaded to Azure Blob Storage, when downloaded to Windows machine does not open and the file size is not same

1 answer

Your answer