C#: Processing CSV files (Part 1)
Introduction
This series will explore various aspects for importing a CSV file with comma-separated-values (.csv) into a SQL-Server database. CSV files are a common way to share data in plain text format from sources such as a database table(s) to another database e.g. from SQL-Server to an Oracle database.
The accompanying source code and code blocks have been kept very simply so that following along and learning the basics is not overwhelming as this generally can happen the deeper into the import process you the develop go.
When data exported from a database to a client database that has a matching database table(s) with matching columns the process is not always simple, for example business rules may indicate new incoming data can’t overwrite existing data or incoming data needs to be merged with existing data.
In the wild rarely is a simple import possible as database data types all have the same basic types but are handled differently from database to database. Couple this with a flat CSV file may need to be split up into multiple database tables.
Considerations for processing data
The following should always be considered when importing CSV files.
- All columns are suspect to be missing altogether or missing in one or more rows.
- Mixed data types, consider a column with dates where some rows may have malformed dates, dates setup for a different culture, columns that should be numeric were some rows have no value or unexpected format etc.
- Columns which have values that are not valid to your business e.g. a list of products that need to map to a product table where there are products that you don’t handle.
- Column values out of range e.g. a numeric column has a range of 1 through 10 but incoming data has values 1 through 100.
- The file is in use by another process and is locked.
- The file is extremely large and processing time may take hours, have a plan such as to run a nightly job.
- Handling rows/columns that don’t fit into the database, have a plan to handle them as several examples will be shown in this series.
- Offering clients, a method(s) to review suspect data, modify or reject the data.
- Consider an intermediate database table so that processing suspect data can be done over time especially when there is a large data set that may take hours or days to process.
Consider working with CSV files as a puzzle no matter what the structure should be and that parsing different files usually has their own quirks.
Part one goals
To read a simple CSV file just over 7,500 records, nine columns with types ranging from integer, float, date time and strings with malformed data.
To parse data a TextFieldParser will be used to read and parse data. Alternates to a TextFieldParser are reading data using a Stream (StreamReader) or OleDb when sticking with pure Microsoft classes. Outside of this there are several libraries that can handle reading CSV files yet as stated this series is solely for working with Microsoft classes.
During parsing assertion is performed to validate data is the proper types, not empty and if in validate ranges. Data read in is placed into a list of a class designed to handle the data read in from the CSV file.
The TextFieldParser class does a great job at processing incoming data which is why this class was selected. As with any class there can be unknowns which become known once you have worked with them and learn them. With the TextFieldParser when looping though lines in a file, empty lines are skipped. In the code sample nothing is done but the line count will be off by the amount of empty lines encountered to what might be learned from opening the file in Notepad++ or similar text editor. Using OleDb or a Steam lines are not ignored but nothing is truly gained if the record count is correct e.g. there are 150 lines were 50 lines are empty and you expect 100 lines of valid data. This means you have received the correct amount of data, just that there are empty lines to filter out.
Requires
Microsoft Visual Studio 2017 (or higher)
Microsoft SQL-Server Express or the full SQL-Server.
Visual interface
The interface is done using Windows forms project as these types of projects are easy to setup then setting a web project up coupled with a Windows form project need not be installed on a user’s machine but instead may be executed from a shared location.
File selection
In the code samples below a hard-coded file is used, in the wild a file may be selected by a file selection dialog, by reading one or more files from a directory listing. If the process were to be from a directory listing then the results would go directly to a intermediate table for review while in the code samples provided here they are sent directly to a DataGridView.
Parsing data using StreamReader
First check to ensure the file to parse exists. In the following code block mHasException and mLastException are from a base exception class which the class for parsing inherits. The return type is a ValueTuple (installed using NuGet Package Manager).
if (!File.Exists(_inputFileName))
{
mHasException = true;
mLastException = new FileNotFoundException($"Missing {_inputFileName}");
return (mHasException, new List<DataItem>(),new List<DataItemInvalid>() );
}
If the file exists the next step is to setup several variables which will be used for validation purposes and return types which will contain valid and if presented invalid data when read in data from the CSV file.
var validRows = new List<DataItem>();
var invalidRows = new List<DataItemInvalid>();
var validateBad = 0;
int index = 0;
int district = 0;
int grid = 0;
int nCode = 0;
float latitude = 0;
float longitude = 0;
The following code block follows the code block above.
A while statement is used to loop through each line in the CSV file. For each line, split the line by comma in this case which is the most common delimiter. Next validate there are nine elements in the string array. If there are not nine elements in the array then place them into a possible reject container.
Note that the first line contains column names which is skip by checking the index/line number stored in the variable index.
Following the check for nine elements int a line seven elements in the string array are checked to ensure they can be converted to the expected data type ranging from date to numerics and also empty string values.
Passing the type check above the section under the comment Questionable fields will do several more checks e.g. does the NICIC field contain data that is not in an expected range. Note all data should be checked here such as the data in part[3] as this can be subjective to the data in other elements in the array so this is left to the review process which will provides a grid with a dropdown of validate selections to select from. If there are issues to review a record a property is set to flag the data for a manual review process and loaded into a list.
try
{
using (var readFile = new StreamReader(_inputFileName))
{
string line;
string[] parts;
while ((line = readFile.ReadLine()) != null)
{
parts = line.Split(',');
index += 1;
if (parts == null)
{
break;
}
index += 1;
validateBad = 0;
if (parts.Length != 9)
{
invalidRows.Add(new DataItemInvalid() { Row = index, Line = string.Join(",", parts) });
continue;
}
// Skip first row which in this case is a header with column names
if (index <= 1) continue;
/*
* These columns are checked for proper types
*/
var validRow = DateTime.TryParse(parts[0], out var d) &&
float.TryParse(parts[7].Trim(), out latitude) &&
float.TryParse(parts[8].Trim(), out longitude) &&
int.TryParse(parts[2], out district) &&
int.TryParse(parts[4], out grid) &&
!string.IsNullOrWhiteSpace(parts[5]) &&
int.TryParse(parts[6], out nCode);
/*
* Questionable fields
*/
if (string.IsNullOrWhiteSpace(parts[1]))
{
validateBad += 1;
}
if (string.IsNullOrWhiteSpace(parts[3]))
{
validateBad += 1;
}
// NICI code must be 909 or greater
if (nCode < 909)
{
validateBad += 1;
}
if (validRow)
{
validRows.Add(new DataItem()
{
Id = index,
Date = d,
Address = parts[1],
District = district,
Beat = parts[3],
Grid = grid,
Description = parts[5],
NcicCode = nCode,
Latitude = latitude,
Longitude = longitude,
Inspect = validateBad > 0
});
}
else
{
// fields to review in specific rows
invalidRows.Add(new DataItemInvalid() { Row = index, Line = string.Join(",", parts) });
}
}
}
}
catch (Exception ex)
{
mHasException = true;
mLastException = ex;
}
Once the above code has completed the following line of code returns data to the calling form/window which is a ValueTupler.
return (IsSuccessFul, validRows, invalidRows);
Parsing data using TextFieldParser
This example uses a TextFieldParser to process data. Rather then splitting lines manually as done above TextFieldParser.ReadFields method handles the splitting by the delimiter assigned in parser.Delimiters. The remainder for validating data is no different then done with StreamReader. One major difference is empty lines are ignored unlike with SteamReader.
public (bool Success, List<DataItem>, List<DataItemInvalid>, int EmptyLineCount) LoadCsvFileTextFieldParser()
{
mHasException = false;
var validRows = new List<DataItem>();
var invalidRows = new List<DataItemInvalid>();
var validateBad = 0;
int index = 0;
int district = 0;
int grid = 0;
int nCode = 0;
float latitude = 0;
float longitude = 0;
var emptyLineCount = 0;
var line = "";
try
{
/*
* If interested in blank line count
*/
using (var reader = File.OpenText(_inputFileName))
{
while ((line = reader.ReadLine()) != null) // EOF
{
if (string.IsNullOrWhiteSpace(line))
{
emptyLineCount++;
}
}
}
using (var parser = new TextFieldParser(_inputFileName))
{
parser.Delimiters = new[] { "," };
while (true)
{
string[] parts = parser.ReadFields();
if (parts == null)
{
break;
}
index += 1;
validateBad = 0;
if (parts.Length != 9)
{
invalidRows.Add(new DataItemInvalid() { Row = index, Line = string.Join(",", parts) });
continue;
}
// Skip first row which in this case is a header with column names
if (index <= 1) continue;
/*
* These columns are checked for proper types
*/
var validRow = DateTime.TryParse(parts[0], out var d) &&
float.TryParse(parts[7].Trim(), out latitude) &&
float.TryParse(parts[8].Trim(), out longitude) &&
int.TryParse(parts[2], out district) &&
int.TryParse(parts[4], out grid) &&
!string.IsNullOrWhiteSpace(parts[5]) &&
int.TryParse(parts[6], out nCode);
/*
* Questionable fields
*/
if (string.IsNullOrWhiteSpace(parts[1]))
{
validateBad += 1;
}
if (string.IsNullOrWhiteSpace(parts[3]))
{
validateBad += 1;
}
// NICI code must be 909 or greater
if (nCode < 909)
{
validateBad += 1;
}
if (validRow)
{
validRows.Add(new DataItem()
{
Id = index,
Date = d,
Address = parts[1],
District = district,
Beat = parts[3],
Grid = grid,
Description = parts[5],
NcicCode = nCode,
Latitude = latitude,
Longitude = longitude,
Inspect = validateBad > 0
});
}
else
{
// fields to review in specific rows
invalidRows.Add(new DataItemInvalid() { Row = index, Line = string.Join(",", parts) });
}
}
}
}
catch (Exception ex)
{
mHasException = true;
mLastException = ex;
}
return (IsSuccessFul, validRows, invalidRows,emptyLineCount);
}
Parsing data using OleDb
This method "reads" lines from a CSV file with the disadvantage of all fields are not typed and carry more baggage than needed for processing lines from the CSV file which will make a difference in time to process with larger CSV files.
public DataTable LoadCsvFileOleDb()
{
var connString = $@"Provider=Microsoft.Jet.OleDb.4.0;.....";
var dt = new DataTable();
try
{
using (var cn = new OleDbConnection(connString))
{
cn.Open();
var selectStatement = "SELECT * FROM [" + Path.GetFileName(_inputFileName) + "]";
using (var adapter = new OleDbDataAdapter(selectStatement, cn))
{
var ds = new DataSet("Demo");
adapter.Fill(ds);
ds.Tables[0].TableName = Path.GetFileNameWithoutExtension(_inputFileName);
dt = ds.Tables[0];
}
}
}
catch (Exception ex)
{
mHasException = true;
mLastException = ex;
}
return dt;
}
Reviewing
The following window has several buttons at the bottom. The Process button executes reading the CSV file using in this case StreamReader. The dropdown will contain any line number which needs to be inspected, pressing the inspect button moves to that line in the grid, this would be for a small amount of lines with issues or to get a visual on a possible larger problem. The button labeled Review will popup a child window to permit edits that will update the main window below.
Child window shown when pressing the "Review" button.
The only true validation done on this window is to provide a list of valid values for the beat field using a Dropdown from a static list. As this series continues a database reference table will replace the static list.
Code for validating through a Drop-down in the DataGridView.
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using WindowsFormsApp1.Classes;
namespace WindowsFormsApp1
{
public partial class ReviewForm : Form
{
private BindingSource _bs = new BindingSource();
private List<DataItem> _data;
/// <summary>
/// Provide access by the calling form to the data presented
/// </summary>
public List<DataItem> Data
{
get { return _data; }
}
/// <summary>
/// Acceptable values for beat field. In part 2 these will be read from a database reference table.
/// </summary>
private List<string> _beatList = new List<string>()
{
"1A", "1B", "1C", "2A", "2B", "2C", "3A", "3B", "3C", "3M", "4A",
"4B", "4C", "5A", "5B", "5C", "6A", "6B", "6C"
} ;
public ReviewForm()
{
InitializeComponent();
}
public ReviewForm(List<DataItem> pData)
{
InitializeComponent();
_data = pData;
Shown += ReviewForm_Shown;
}
private void ReviewForm_Shown(object sender, EventArgs e)
{
dataGridView1.AutoGenerateColumns = false;
// ReSharper disable once PossibleNullReferenceException
((DataGridViewComboBoxColumn) dataGridView1.Columns["beatColumn"]).DataSource = _beatList;
_bs.DataSource = _data;
dataGridView1.DataSource = _bs;
dataGridView1.ExpandColumns();
dataGridView1.EditingControlShowing += DataGridView1_EditingControlShowing;
}
/// <summary>
/// Setup to provide access to changes to the current row, here we are only interested in the beat field.
/// Other fields would use similar logic for providing valid selections.
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
private void DataGridView1_EditingControlShowing(object sender, DataGridViewEditingControlShowingEventArgs e)
{
if (dataGridView1.CurrentCell.IsComboBoxCell())
{
if (dataGridView1.Columns[dataGridView1.CurrentCell.ColumnIndex].Name == "beatColumn")
{
if (e.Control is ComboBox cb)
{
cb.SelectionChangeCommitted -= _SelectionChangeCommitted;
cb.SelectionChangeCommitted += _SelectionChangeCommitted;
}
}
}
}
/// <summary>
/// Update current row beat field
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
private void _SelectionChangeCommitted(object sender, EventArgs e)
{
if (_bs.Current !=null)
{
if (!string.IsNullOrWhiteSpace(((DataGridViewComboBoxEditingControl)sender).Text))
{
var currentRow = (DataItem) _bs.Current;
currentRow.Beat = ((DataGridViewComboBoxEditingControl) sender).Text;
currentRow.Inspect = false;
}
}
}
}
}
Extension methods used in the above code blocks.
namespace WindowsFormsApp1.Classes
{
public static class DataGridViewExtensions
{
/// <summary>
/// Expand all columns excluding in this case Orders column
/// </summary>
/// <param name="sender"></param>
public static void ExpandColumns(this DataGridView sender)
{
sender.Columns.Cast<DataGridViewColumn>().ToList()
.ForEach(col => col.AutoSizeMode = DataGridViewAutoSizeColumnMode.AllCells);
}
/// <summary>
/// Used to determine if the current cell type is a ComboBoxCell
/// </summary>
/// <param name="sender"></param>
/// <returns></returns>
public static bool IsComboBoxCell(this DataGridViewCell sender)
{
var result = false;
if (sender.EditType != null)
{
if (sender.EditType == typeof(DataGridViewComboBoxEditingControl))
{
result = true;
}
}
return result;
}
}
}
Data classes to contain data read from the CSV file.
Good/questionable data class
namespace WindowsFormsApp1.Classes
{
public class DataItem
{
public int Id { get; set; }
public DateTime Date { get; set; }
public string Address { get; set; }
public int District { get; set; }
public string Beat { get; set; }
public int Grid { get; set; }
public string Description { get; set; }
public int NcicCode { get; set; }
public float Latitude { get; set; }
public float Longitude { get; set; }
public bool Inspect { get; set; }
public string Line => $"{Id},{Date},{Address},{District},{Beat}," +
$"{Grid},{Description},{NcicCode},{Latitude},{Longitude}";
public override string ToString()
{
return Id.ToString();
}
}
}
Invalid data class.
namespace WindowsFormsApp1.Classes
{
public class DataItemInvalid
{
public int Row { get; set; }
public string Line { get; set; }
public override string ToString()
{
return $"[{Row}] '{Line}'";
}
}
}
Up coming in part 2
- Storing data in a SQL-Server table which will then be presented in a grid rather than showing read data into a grid.
- Storing changes performed in the review process back to the database intermediate table.
- Storing references in SQL-Server rather than static variables.
- Presenting a move to production table in SQL-Server.
- Presenting several options to push data from data classes to SQL-Server.
Summary
In this article thoughts/ideas along with suggestions have been presented to dealing with CSV files which is to be considered a building block which continues in part 2 of this series.
See also
Source code