SSIS: Export Multiple File With Fixed Size
Introduction
This article will demonstrate how a “master csv file” can be split into multiple “child files” having approximately matching file size (bytes) and indexing file name using SSIS.
Problem
Exporting a fixed size file will be a difficult task as we cannot compromise file contents at the cost of file size accuracy. Moreover each row size can be different. Even if we read a source file, line by line and copy it to another file we may need to identify row size so that we can compromise row/rows rather than a file size. There can be other alternative of compressing a file such as by using WinZip however the output files will remain as a zip file. We might need to zip and unzip it to and fro.**
**
Solution
The size of each Child file is compared to a predefine size (bytes) and each row is inserted from the Master file until the expected size of the Child file exceeds the threshold or the nearest size.
In this solution we are taking care of data truncation problem and taking care of the size variations in each row.
A log file has been created to capture the information of each Child file. It provides information like file name, size in bytes, total rows and extract date and time for each files.
Open SQL Server Data Tools or BIDS and Create an SSIS project. The Name of the package is “ExtractMultipleFile.dtsx”.
Step 1: Adding SSIS Variable
Add a SSIS variable "ExtractPath" to store the path where we can extract all Child Files. Let us assign a default path say, "C:\TESTING\ExtractPath"
Step 2: Exporting Master CSV File from OLEDB Source
Add a data flow task to extract a Master file(csv) from OLEDB source. The name of the master file is "MasterFile.csv" and it is exported under "C:\TESTING\ExtractPath” i.e “ExtractPath”.
Step 3: Splitting Master File into Child Files
Add a script task to the package and set the ReadOnlyVariable property as “ExtractPath" variable. Refer below list of variables that are also declared to stored Master, Child and Log file path in the script task.
Dim MasterPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\MasterFile.csv" Dim ChildPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\ChildFile.csv" Dim LogPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\log.txt" |
The basic idea is to use System.IO namespace to create, read & write files. Each row of the Master file is read line by line and write the contents to a Child file stream until the split criteria is within the acceptable size. Refer Split File Condition below.
Create Child and Log File
We can create a new Child file by using the below procedure. It has two parameter i.e Path and fs i.e a file stream. Once the file is created it will return a file stream object to the calling portion.
Sub CreateFile(ByVal Path As String, ByRef fs As FileStream) If File.Exists(Path) Then File.Delete(Path) fs = File.Create(Path) End Sub |
Each files has an indexed name to identify the sequence of the split process, for instance ChildFile_1.csv. The variable “fileCounter” identifies the nth file to be created.
Dim childfs As FileStream Dim logfs As FileStream Call CreateFile(Replace(ChildPath, ".csv", "_" & fileCounter & ".csv"), childfs) Call CreateFile(Logpath, logfs) |
Write Child and Log File
An integer variable “filesizeCounter” is used to store and measure the size of a Child File in bytes. Each time the line information from the Master File is written to a Child File stream, the size of the new line is calculated and stored incrementally to the variable “filesizeCounter”.
Sub WriteFile(ByRef fs As FileStream, ByVal LineInfo As String, ByRef filesizeCounter As Integer) Dim info As Byte() = New Text.UTF8Encoding(True).GetBytes(LineInfo & vbNewLine) fs.Write (info, 0, info.Length) ' Add some information to the file. filesizeCounter = filesizeCounter + info.Length End Sub |
Split File Condition
Suppose we want to create Child files of 400 KB each i.e SizeKB = 400. Even if the variable "FilesizeCounter" or the current Child File size exceeds the expected size (SizeKB * 1024 bytes), the recent line read from the Master file will be written to the Child File and then reset the FilesizeCounter to zero. It means that some fractions of bytes can be up and down i.e ~400KB.
A new Child File will be created whenever the file size counter is reset to zero.
fileCounter = fileCounter + 1 Call CreateFile(Replace(ChildPath, ".csv", "_" & fileCounter & ".csv"), childfs) End If |
We can also increase the Child File size to MB by changing the below operand in IF statement.
Call WriteFile(childfs, Parentsr.ReadLine() & vbNewLine, filesizeCounter) If Parentsr.EndOfStream Then childfs.Close() Else Call WriteFile(childfs, Parentsr.ReadLine() & vbNewLine, filesizeCounter) childfs.Close() filesizeCounter = 0 End if |
Complete SSIS Script Code in VB.NET
Imports System.IO
Public Sub Main() Dim FileSize As Integer = 400 'specify in KB. Can be modified. Dim MasterPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\MasterFile.csv" Dim ChildPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\ChildFile.csv" Dim LogPath As String = CStr(Dts.Variables("ExtractPath").Value) & "\log.txt" Try Call SplitFile(MasterPath, ChildPath, LogPath, FileSize) Catch ex As Exception MsgBox(ex.Message) End Try Dts.TaskResult = ScriptResults.Success End SubSub SplitFile(ByVal MasterPath As String, ByVal ChildPath As String, ByVal Logpath As String, ByVal SizeKB As Integer) Dim filesizeCounter As Integer Dim fileCounter As Integer = 0 Dim RowCount As Integer = 0 Dim Parentsr As StreamReader = File.OpenText (MasterPath) Dim childfs As FileStream Dim logfs As FileStream Call CreateFile(Logpath, logfs) 'Create Log File Do While Parentsr.Peek() >= 0 'Looping Master File stream If filesizeCounter = 0 Then fileCounter = fileCounter + 1 Call CreateFile(Replace(ChildPath, ".csv", "_" & fileCounter & ".csv"), childfs) End If If filesizeCounter < (SizeKB * 1024) Then ' 409600/1024=400 KB Call WriteFile(childfs, Parentsr.ReadLine() & vbNewLine, filesizeCounter) If Parentsr.EndOfStream Then childfs.Close() Call WriteFile(logfs, "---------", 0) Call WriteFile(logfs, "File Name:" & _ vbNewLine & "Row Count:" & RowCount & _ vbNewLine & "Size(Bytes):" & filesizeCounter & _ vbNewLine & "Extract End:" & Now().ToString, 0) End If RowCount = RowCount + 1 Else Call WriteFile(childfs, Parentsr.ReadLine() & vbNewLine, filesizeCounter) childfs.Close() ' Close Child File Call WriteFile(logfs, "---------", 0) Call WriteFile(logfs, "File Name:" & vbNewLine & "Row Count:" & RowCount & _ vbNewLine & "Size(Bytes):" & filesizeCounter & _ vbNewLine & "Extract End:" & Now().ToString, 0) RowCount = RowCount + 1 filesizeCounter = 0 ' Reset file size counter End If Loop Parentsr.Close() ' Close Master file logfs.Close() ' Close Log File End Sub Sub CreateFile(ByVal Path As String, ByRef fs As FileStream) If File.Exists(Path) Then File.Delete(Path) 'Delete the if already exist. fs = File.Create(Path) End SubSub WriteFile(ByRef fs As FileStream, ByVal LineInfo As String, ByRef filesizeCounter As Integer) Dim info As Byte() = New Text.UTF8Encoding(True).GetBytes(LineInfo & vbNewLine) fs.Write (info, 0, info.Length) ' Add some information to the file. filesizeCounter = filesizeCounter + info.Length End Sub |
Step 4: Run the package and Output review
Conclusion
The Master CSV File has been splitted into 5 Child files each of ~400 KB except the last file. The number of rows in each Child Files are different however the size of each files is approximately 400 KB.