Using PowerShell to generate a large test CSV file with random data
I recently posted a new blog that shows how to load a very large CSV file into Excel, breaking the limit of 1 million rows in a single Excel sheet. If you haven't seen it, you should check it out: Loading CSV/text files with more than a million rows into Excel.
One of the challenges for me while writing that blog post was finding sample data to demonstrate that Excel feature. I have a few files that are that big from work, but obviously I could not share those publicly. I also tried to find some sample data on public sites like https://data.gov but I could not find anything with more than 1 million rows.
The only option I had left was creating a sample file myself with some random data. I created a simple PowerShell script to create a file with a few columns filled with random data. You would think that it would be straightforward to do that, but my first version, which wrote one line to the file at a time, would take hours to generate a file with 2 million rows. Not good enough.
I optimized it by creating a thousand rows in memory and writing that batch to the file in one operation. Repeating that 2,000 times gave me a file with two million rows in under 20 minutes. It included 4 columns and about 36 characters per row. I ended up with a file size of 68.8 MB (72,144,471 bytes).
Note that I had to specify the encoding, since the default encoding in a regular PowerShell output would create double-byte characters and would make the resulting file twice as large.
Below is the PowerShell script, both as text and a picture with the color coding from the PowerShell ISE.
$Start = Get-Date $BaseDate = (Get-Date "2016/12/31 12:59:59") $FullYear = 366*24*60*60 $File = ”c:\users\jose\file.csv” "Date,Customer,Type,Value" | Out-File -FilePath $File -Encoding utf8 1..2000 | % { Write-Progress -Activity “Generating File” -PercentComplete ($_/20) $Lines = "" 1..1000 | % { $Dt = $BaseDate.AddSeconds(-(Get-Random $FullYear)) $Ct = (Get-Random 100) if ((Get-Random 5) -lt 4) {$Ty="Sale"} else { $Ty="Return"} $Vl = (Get-Random 100000) / 100 $Lines += [string]$Dt + "," + [string]$Ct + "," + $Ty + "," + [string]$Vl + [char]10 + [char]13 } $Lines | Out-File -FilePath $File -Encoding utf8 -Append } $End = Get-Date "Started at $Start and ended at $End" $Diff = ($End-$Start).TotalSeconds "Processing took $Diff seconds"