C#: Processing CSV files (Part 1A)
Introduction
This article is a continuation of a series on parsing delimited text file leading off with part 1 of the series. This article will discuss prevalidation ideas rather than performing all parsing and validation together.
What should be considered when parsing delimited files that the unexpected should always be considered, never think too much assertion has been performed.
Base data
Throughout this article the file is delimited by a comma as per below where each line in the file is consider a customer.
Alana,Petty,AF7Y7VM47Z7J2TMF2BXKRWV8PPJ,Fremont,Missouri,Andorra,5442.20,1956-08-31 17:01:39.3333997
Which can be represented in a class for a matching table in a SQL-Server database table.
public class Customer
{
public int Id { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
public string Street { get; set; }
public string City { get; set; }
public string State { get; set; }
public string Country { get; set; }
public decimal? Balance { get; set; }
public DateTime? LastPaid { get; set; }
public override string ToString()
{
return $"{FirstName}, {LastName}";
}
}
Perfect world parsing
When information in a delimited text file "always" perfected, there may be empty lines while all non-empty lines the data is correct data types a file may be parses in one line passing a file name to a method which in turn returns a list of (in this case) a Customer.
public List<Customer> ReadCustomersPerfectWorld(string fileName)
{
return (
from line in File.ReadAllLines(fileName)
where line.Length > 0
let lineParts = line.Split(',')
select new Customer()
{
FirstName = lineParts[0],
LastName = lineParts[1],
Street = lineParts[2],
City = lineParts[3],
State = lineParts[4],
Country = lineParts[5],
Balance = Convert.ToDecimal(lineParts[6]),
LastPaid = Convert.ToDateTime(lineParts[7])
}).ToList();
}
Imperfect world parsing
When there is imperfection a developer may enclose read logic in a try/catch statement.
public List<Customer> ReadCustomersImperfectWorld(string fileName)
{
var customerList = new List<Customer>();
try
{
customerList = (
from line in File.ReadAllLines(fileName)
where line.Length > 0
let lineParts = line.Split(',')
select new Customer()
{
FirstName = lineParts[0],
LastName = lineParts[1],
Street = lineParts[2],
City = lineParts[3],
State = lineParts[4],
Country = lineParts[5],
Balance = Convert.ToDecimal(lineParts[6]),
LastPaid = Convert.ToDateTime(lineParts[7])
}).ToList();
}
catch (Exception ex)
{
mHasException = true;
mLastException = ex;
}
return customerList;
}
Although this might seem okay there is no way to deal with issues like invalid data types and may only resolve a missing file or no permissions to read a file. A missing file should be asserted prior to calling this method while insufficient permissions are to be handled outside of the application usually by moving the file to were permissions are granted or to have the user permissions set in Active Directory.
Using assertion parsing/processing
There are third party libraries for handling validation and processing of delimited files while many time no matter how good they are a developer must spend time learning them, instead consider writing in house classes for validation.
If the validation is done through a user interface the process should always present both good and invalidated data where in this case good deal is shown in a window which started processing an a child window for invalid data which permits the user to identify what actions to take to correct the issues. In a service which runs without a user interface write to a log file.
In the following section valid data is displayed in a grid and invalid data in a secondary window. Although this is done in a desktop application the same applies in web applications where a link would be present to view incorrect data.
The following class performs all validation, stored valid customer data in a list and issues in a separate list. Step one is to validate after splitting a line that the array has a specific column count. If not do not continue as the assertions to follow rely on a field count.
var parts = customerLine.Split(',');
const int fieldCount = 8;
if (parts.Length != fieldCount)
{
IssueList.Add($"Line: {lineIndex} Field count {parts.Length}, expected {fieldCount}.");
return;
}
Business rules require first and last name, elements (columns) index 0 and index 1. This is done using string.IsNullOrWhiteSpace.
if (!string.IsNullOrWhiteSpace(parts[0]))
{
Customer.FirstName = parts[0];
}
else
{
IssueList.Add($"Line: {lineIndex} FirstName required");
}
if (!string.IsNullOrWhiteSpace(parts[1]))
{
Customer.LastName = parts[1];
}
else
{
IssueList.Add($"Line: {lineIndex} LastName required");
}
Address columns in indexes 1,3,4 and 5 are optional.
Customer.Street = parts[2] ?? "";
Customer.City = parts[3] ?? "";
Customer.State = parts[4] ?? "";
Customer.Country = parts[5] ?? "";
In column 6, a numeric decimal is expected. Here assertion is performed using decimal.TryParse using C# 7 our var.
if (!string.IsNullOrWhiteSpace(parts[6]))
{
if (decimal.TryParse(parts[6], out var balanceResult))
{
Customer.Balance = balanceResult;
}
else
{
IssueList.Add($"Line: {lineIndex} Balance value '{parts[6]}' is invalid decimal");
}
}
else
{
IssueList.Add($"Line: {lineIndex} Balance is required");
}
Column 7 is a datetime, similar to the last column TryParse is used against a date.
if (!string.IsNullOrWhiteSpace(parts[7]))
{
if (DateTime.TryParse(parts[7], out var paidResult))
{
Customer.LastPaid = paidResult;
}
else
{
IssueList.Add($"Line: {lineIndex} Last paid '{parts[7]}' is not valid date");
}
}
else
{
IssueList.Add($"Line: {lineIndex} Last paid is required");
}
Finally a constructed Customer and issue list are set.
CustomerResults.Customers.Add(Customer);
CustomerResults.IssuesList.AddRange(IssueList);
Entire class for processing a line in a text file.
public class CustomerCreator
{
public Customer Customer { get; set; }
public CustomerResults CustomerResults { get; set; }
public List<string> IssueList { get; set; }
public bool HasIssues => IssueList.Count > 0;
public string Issues => string.Join(",", IssueList.ToArray());
public CustomerCreator(string customerLine, int lineIndex)
{
IssueList = new List<string>();
Customer = new Customer();
CustomerResults = new CustomerResults();
Build(customerLine, lineIndex);
}
private void Build(string customerLine, int lineIndex)
{
var parts = customerLine.Split(',');
const int fieldCount = 8;
if (parts.Length != fieldCount)
{
IssueList.Add($"Line: {lineIndex} Field count {parts.Length}, expected {fieldCount}.");
return;
}
if (!string.IsNullOrWhiteSpace(parts[0]))
{
Customer.FirstName = parts[0];
}
else
{
IssueList.Add($"Line: {lineIndex} FirstName required");
}
if (!string.IsNullOrWhiteSpace(parts[1]))
{
Customer.LastName = parts[1];
}
else
{
IssueList.Add($"Line: {lineIndex} LastName required");
}
Customer.Street = parts[2] ?? "";
Customer.City = parts[3] ?? "";
Customer.State = parts[4] ?? "";
Customer.Country = parts[5] ?? "";
if (!string.IsNullOrWhiteSpace(parts[6]))
{
if (decimal.TryParse(parts[6], out var balanceResult))
{
Customer.Balance = balanceResult;
}
else
{
IssueList.Add($"Line: {lineIndex} Balance value '{parts[6]}' is invalid decimal");
}
}
else
{
IssueList.Add($"Line: {lineIndex} Balance is required");
}
if (!string.IsNullOrWhiteSpace(parts[7]))
{
if (DateTime.TryParse(parts[7], out var paidResult))
{
Customer.LastPaid = paidResult;
}
else
{
IssueList.Add($"Line: {lineIndex} Last paid '{parts[7]}' is not valid date");
}
}
else
{
IssueList.Add($"Line: {lineIndex} Last paid is required");
}
CustomerResults.Customers.Add(Customer);
CustomerResults.IssuesList.AddRange(IssueList);
}
}
The above is called from a method responsible for iterating a file using a stream reader which in this case is called from a button click event in a window. Since there is nothing indicating the window type the project type may be windows forms or WPF.
The return type is a ValueTuple which has been installed using NuGet. Note the last three line prior to return statement would be for a developer to see if there were variations on the total line count, how many empty lines for instance.
public (List<Customer> customers, List<string> issues) ReadCustomersPerfectWorldCreator(string fileName)
{
var customerList = new List<Customer>();
var issueList = new List<string>();
var counterValidLines = 0;
var counterEmptyLines = 0;
var currentLineIndex = 1;
string line;
var file = new StreamReader(fileName);
try
{
while ((line = file.ReadLine()) != null)
{
if (!string.IsNullOrWhiteSpace(line))
{
counterValidLines++;
var result = new CustomerCreator(line, currentLineIndex);
if (result.HasIssues)
{
issueList.AddRange(result.IssueList);
}
else
{
customerList.Add(result.Customer);
}
}
else
{
counterEmptyLines++;
}
currentLineIndex++;
}
}
catch (Exception exception)
{
mHasException = true;
mLastException = exception;
}
finally
{
file.Close();
}
Console.WriteLine("There were {0} valid lines.", counterValidLines);
Console.WriteLine("There were {0} empty lines.", counterEmptyLines);
Console.WriteLine(counterEmptyLines + counterValidLines);
return (customerList, issueList);
}
In the calling button click event FileOperations class is created, the file is identified then a call is made to the method above. The first assertion is to check if there were unexpected issues, runtime exceptions followed by checking to see if the issue list returned by the method ReadCustomersPerfectWprldCreator has issues.
If there is good data show this data and prompt to view invalid data. If the next step were to import data the imperfect data would be excluded as all lines with issues has been excluded from the good data list.
private void RealWorldGoodButton_Click(object sender, EventArgs e)
{
var fileOperations = new FileOperations();
var fileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Import1.txt");
var (customers, issueList) = fileOperations.ReadCustomersPerfectWorldCreator(fileName);
if (fileOperations.IsSuccessFul)
{
_bindingSource.DataSource = customers;
dataGridView1.DataSource = _bindingSource;
}
else
{
MessageBox.Show(fileOperations.LastExceptionMessage);
}
if (issueList.Count <= 0) return;
if (Question("There were imperfect data, review?"))
{
var issuesForm = new IssuesForm(issueList);
issuesForm.Text = "Encountered these issues";
try
{
issuesForm.ShowDialog();
}
finally
{
issuesForm.Dispose();
}
}
}
Preprocessing data
Suppose there may be no empty lines, using a method below would assert there are no empty lines. If the int list has elements then there are one or more lines which are empty, the int list hold line numbers which are empty in the event this needs to be pointed out to the entity supplying the file.
Empty lines
/// <summary>
/// For validating each line in a file has information
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public List<int> EmptyLines(string fileName)
{
var lineList = new List<int>();
var line = "";
int lineIndex = 1;
using (var reader = new StreamReader(fileName))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
if (string.IsNullOrWhiteSpace(line))
{
lineList.Add(lineIndex);
}
lineIndex++;
}
}
return lineList;
}
Field count
The following method checks to ensure there are the expected field count for each line which would run after the EmptyLines method above. If the returning list has one or more elements they represent the line number in the file which can be used to either manually inspect or report to the entity supplying the file.
/// <summary>
/// Validate each line has the proper field count
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public List<int> HasValidFieldCount(string fileName)
{
var lineList = new List<int>();
var line = "";
int lineIndex = 1;
const int fieldCount = 8;
using (var reader = new StreamReader(fileName))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
if (line != null && line.Split(',').Length != fieldCount)
{
lineList.Add(lineIndex);
}
lineIndex++;
}
}
return lineList;
}
Checking data types
When a value is any other type than a string there needs to be assertion to ensure the value can be stored. In the example below the last element must be a valid date. This would be called after the last two methods.
/// <summary>
/// Used for validating each line has a valid date field
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public List<string> ValidatePaidDate(string fileName)
{
const int fieldCount = 8;
var issues = new List<string>();
var line = "";
int lineIndex = 1;
using (var reader = new StreamReader(fileName))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
{
var parts = line.ProcessSpacings().Split(',');
if (parts.Length != fieldCount)
{
issues.Add($"Line {lineIndex} has no data");
}
if (!string.IsNullOrWhiteSpace(parts[7]))
{
if (!DateTime.TryParse(parts[7], out var _ ))
{
issues.Add($"Line {lineIndex} '{parts[7]}' is not a valid date");
}
}
}
lineIndex++;
}
}
return issues;
}
Dealing with unwanted spaces
Rather than
Alana,Petty,AF7Y7VM47Z7J2TMF2BXKRWV8PPJ,Fremont,Missouri,Andorra,5442.20,1956-08-31 17:01:39.3333997
Each line has unwanted spaces.
Alana, Petty, AF7Y7VM47Z7J , Fremont,Missouri,Andorra
Patricia, Cooke,B60BIJLS36KKU,Louisville, Utah ,China
Jami,Browning,UY99GBG0YE9LRCQP1XYG889B7L1XU78, Indianapolis, Tennessee,Madagascar
This needs to be considered and is easy to fix. The proper way to know is when it's first encountered. To fix this read each line in the file, call ProcessSpaces (shown below) extension and write to a StringBuilder which save the results to the same or different file name.
using System;
using System.Linq;
namespace Example1Library.Extensions
{
public static class StringExtension
{
public static string ProcessSpacings(this string line)
{
try
{
return string.Join(",", Array.ConvertAll(line.Split(','),
field => field.Trim()).Select(items => $"{items}"));
}
catch (Exception)
{
return line;
}
}
}
}
Example usage while iterating lines via a StreamReader.
while (!reader.EndOfStream)
{
line = reader.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
{
var parts = line.ProcessSpacings().Split(',');
Uneven columns
In some cases one or more lines may have too much or too little data, for example, in the following file the first three columns are to be imported into a DataGridView with three defined columns and the first line is the column/field names.
The following code assertions
- Does the file exists?
- No, exit method with an empty array, in an actual application the file exists should be done by the caller of the method.
- Yes, continue to read lines in the file
- Checks for empty lines, if present ignore.
- Each line
- On splitting a line
- If length of the line after split is under three or exactly three elements add the array to the list.
- If the length of the line is greater than expected take only the first three elements
- On splitting a line
Note more assertion would be required if one or more elements need to be converted to another type other than string which means returning an object array or a list of T for a class which represents the data to be displayed.
Since the first line are used for header text of the DataGridView additional logic should be used to ensure the array length matches the count of DataGridView columns in the DataGridView. Header array is sent back to the caller (the form) using a custom delegate implemented through LineDataArgs class which is subscribed too after creating an instance of the class FileOperations.
FileOperations class
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ForumQuestionTabbed.Classes
{
public class FileOperations
{
public delegate void LineHandler(object sender,LineDataArgs myArgs);
public event LineHandler OnLineHandler;
public List<string[]> ReadLines(string fileName)
{
var linesList = new List<string[]>();
if (!File.Exists(fileName))
{
return linesList;
}
string line;
var isFirstLine = true;
using (var reader = new StreamReader(fileName))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
if (isFirstLine)
{
isFirstLine = false;
if (!string.IsNullOrWhiteSpace(line))
{
var lineDataArgs = new LineDataArgs(
Array.ConvertAll(
line.Split('\t'), p => p.Trim()));
OnLineHandler(this, lineDataArgs);
}
continue;
}
if (string.IsNullOrWhiteSpace(line)) continue;
var parts = Array.ConvertAll(line.Split('\t'), p => p.Trim());
if (parts.Length <= 3)
{
linesList.Add(parts);
}
else if (parts.Length > 3)
{
linesList.Add(parts.Take(3).ToArray());
}
}
}
return linesList;
}
}
}
Calling form
Creates an instance of FileOperations in form Shown event, using a for-each to iterate lines returned from the file into the DataGridView and on the first line set header text in the following event.
using System;
using System.IO;
using System.Windows.Forms;
using ForumQuestionTabbed.Classes;
namespace ForumQuestionTabbed
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
dataGridView1.AllowUserToAddRows = false;
Shown += Form1_Shown;
}
private void Form1_Shown(object sender, EventArgs e)
{
var fileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Import1.txt");
var fileOperations = new FileOperations();
fileOperations.OnLineHandler += FileOperations_OnReadingFirstLine;
var data = fileOperations.ReadLines(fileName);
foreach (var rowData in data)
{
dataGridView1.Rows.Add(rowData);
}
ActiveControl = dataGridView1;
dataGridView1.CurrentCell = dataGridView1.Rows[0].Cells[0];
dataGridView1.Rows[0].Selected = true;
}
/// <summary>
/// Set column header text to the array elements containing the
/// first line read from the file.
/// </summary>
/// <param name="myObject"></param>
/// <param name="args"></param>
private void FileOperations_OnReadingFirstLine(object myObject, LineDataArgs args)
{
for (int index = 0; index < args.LineArray.Length; index++)
{
dataGridView1.Columns[index].HeaderText = args.LineArray[index];
}
}
}
}
As with prior examples this one may or may not need additional assertions but the rule of thumb is to assert when unsure, wrap code in try-catch where the catch handles the issue or in some cases (rare) does nothing.
Different containers
In all the examples presented a List<Customer> has been used, if a DataTable is required then replace the code to use a list with code to use a DataTable or simply convert the list to a DataTable using the following method.
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
namespace Example1Library.HelperClasses
{
public class ListHelpers
{
public static DataTable ConvertToDataTable<T>(IList<T> data)
{
var properties = TypeDescriptor.GetProperties(typeof(T));
var dataTable = new DataTable();
foreach (PropertyDescriptor prop in properties)
{
dataTable.Columns.Add(prop.Name,
Nullable.GetUnderlyingType(prop.PropertyType)
?? prop.PropertyType);
}
foreach (var item in data)
{
var row = dataTable.NewRow();
foreach (PropertyDescriptor prop in properties)
{
row[prop.Name] = prop.GetValue(item) ?? DBNull.Value;
}
dataTable.Rows.Add(row);
}
return dataTable;
}
}
}
Example usage, ReadCustomerPerfectWorld returns a list<Customer>.
/// <summary>
/// Example where there are zero issues with reading in a delimited file
/// into a strong typed list then finally into a DataTable.
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
private void PerfectWorldExampleDataTableButton_Click(object sender, EventArgs e)
{
var fileOperations = new FileOperations();
var fileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Import1.txt");
_bindingSource.DataSource = ListHelpers.ConvertToDataTable(
fileOperations.ReadCustomersPerfectWorld(fileName));
dataGridView1.DataSource = _bindingSource;
}
Visual Studio solution
In the source code class projects are used to do the file operations while a form calls these methods which allows the class project code to be used in other projects. In the Windows Form project there are several text files used to demonstrate good and bad file operations.
Summary
This article has expanded on part 1 of the series for working with delimited files with preprocessing and extension methods.
See also
C# Processing CSC files part 1