다음을 통해 공유


Strong binding for CSV reader

 

I updated my open source CSV reader to provide parsing rows back into strongly typed objects. You can get it from Nuget as CsvTools 1.0.6.

For example, suppose we have a CSV file “test.csv” like so:

name, species, favorite fruit, score
Kermit, Frog, apples, 18%
Ms. Piggy, Pig, pears, 22%
Fozzy, Bear, bananas, 19.4%

You can open the CSV and read the rows with loose typing (as strings):

 var dt = DataTable.New.Read(@"c:\temp\test.csv");
IEnumerable<string> rows = from row in dt.Rows select row["Favorite Fruit"];

But it’s very convenient to use strongly-typed classes. We can define a strongly-typed class for the CSV:

 enum Fruit
{
    apples,
    pears,
    bananas,
}
class Entry
{
    public string Name { get; set; }
    public string Species { get; set; }
    public Fruit FavoriteFruit { get; set; }
    public double Score { get; set; }
}

We can then read via the strongly-typed class as:

 var dt = DataTable.New.Read(@"c:\temp\test.csv"); 
Entry[] entries = dt.RowsAs<Entry>().ToArray(); // read all entries

We can also use linq expressions like:

IEnumerable<Fruit> x = from row in dt.RowsAs<Entry>() select row.FavoriteFruit;

 

What are the parsing rules?

Parsing can get arbitrarily complex. This uses some simple rules that solved the scenarios I had.

The parser looks at each property on the strong type, and matches that to a column from the CSV. Since property names are going to be restricted to C# identifiers, whereas row names can have arbitrary characters (and thus be invalid C# identifiers), the matching here is flexible. It will match properties to columns just looking at the alphanumeric characters. So the “FavoriteFruit” property matches to “Favorite Fruit” field name.

To actually parse the row value from a string to the target type, T, it uses the following rules:

  1. if T is already a string, just return the value
  2. special case doubles parsing to allow the percentage sign. (Parse 50% as .50).
  3. if T has a TryParse(string, out T) method, then invoke that.  I found TryParse to be significantly faster than invoking a TypeConverter.
  4. Else use a TypeConverter. This is a general and extensible hook.

Errors are ignored. The rationale here is that if I have 3 million rows of CSV data, I don’t want to throw an exception on reading just because one row has bad data.

Under the hood

DataTable.RowsAs<T>() uses expression trees to build a strongly typed dynamic method of Func<Row, T>. I originally uses reflection to enumerate the properties, and then find the appropriate parsing technique, and set the value on the strong type. Switching to pre-compiled methods was about a 10x perf win.

In this case, the generated method looks something like this:

 class EnumParser
{
    const int columnIndex_Name = 0;
    const int columnIndex_species = 1;

    TypeConverter _convertFavoriteFruit = TypeDescriptor.GetConverter(typeof(Fruit));
    const int columnIndex_Fruit = 2;

    const int columnIndex_Score = 3;

    public Entry Parse(Row r)
    {
        Entry newObj = new Entry();
        newObj.FavoriteFruit = (Fruit) _convertFavoriteFruit.ConvertFrom(r.Values[columnIndex_Fruit]);
        newObj.Name = r.Values[columnIndex_Name];
        newObj.Species = r.Values[columnIndex_species];
        newObj.Score = ToDouble(r.Values[columnIndex_Score]);
        return newObj;
    }    
}

The parse method is a Func<Row, Entry> which can be invoked on each row. It’s actually a closure so that it can capture the TypeConverters and only do the lookup once. The mapping between property names and column names can also be done upfront and captured in the columnIndex_* constants.

Comments

  • Anonymous
    May 19, 2012
    For those of us dealing with real data (files 20+ gigabyte in size) ;) What about a version NOT Parsing the whole file into memory ;) Sorry... just spending this weekend doing data exports of a database with around 180 billion data points.

  • Anonymous
    May 19, 2012
    @Thomas - you're in luck. DataTable is a polymorphic base class.  DataTable.New.Read() reads it all into memory and gives you a mutable table back. DataTable.New.ReadLazy() will give you a streaming read over the CSV file. I use that with a file with ~4 million rows and it works nicely.