다음을 통해 공유


LINQ to TEXT and LINQ to CSV

LINQ is a great tool for writing ad-hoc queries and transforms, and occasionally I need to write queries or transforms on text files.  And sometimes I receive CSV files, and need to do something with them.  I wrote a blog post on LINQ to Text files over two years ago.  My coding practices today differ from what I presented in that blog post.  This post presents my current approach for dealing with text files using LINQ, and includes a function for splitting CSV lines.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCIn that post, I detailed an approach for writing lazy queries on text files.  Well, to tell you the truth, I never write lazy queries on text files – I simply use File.ReadAllLines and write queries over the returned string array.  After all, my computer has a lot of RAM (and probably yours does too), and the CSV files I receive are maybe in the 50K size, and are rarely greater than a few hundred K.  It simply doesn’t matter that I read the entire text file into memory.  If I were to write the exact same two queries today, here is how I would write them:

string[] lines = File.ReadAllLines("TextFile.txt");
var t1 = lines
.Where(l => !l.StartsWith("#"))
.Select(l => l.Split(','))
.Select(items => String.Format("{0}{1}{2}",
items[1].PadRight(16),
items[2].PadRight(16),
items[3].PadRight(16)));
var t2 = t1
.Select(l => l.ToUpper());
foreach (var t in t2)
Console.WriteLine(t);

I have a small extension method (CsvSplit) that I use to split lines that are in CSV format.  I’ve had this method around for a while – it’s not written in the functional style.  Instead, it’s a state machine.  I’ve thought about what it would take to rewrite this method in the functional style, and as far as I know, the only way to do it would be to define a grammar, and maybe write a recursive descent parser.  (Actually, there is another approach, but it would be very inefficient, and the code would be longer and less readable.)  Well the job of parsing CSV files simply isn’t worth the effort!  So a state machine it is.  In any case, the ‘functional impurity’ is local to the function.

The semantics of CsvSplit are:

  • CsvSplit is an extension method on the String class.
  • Only a comma is valid for the separator character.
  • Values can be quoted.  The quotes are trimmed.
  • Quoted values can have, of course, internal commas.  Quoted values can also have internal escape sequences: backslash followed by any character, including quote (\”), backslash (\\) or any other character (\a).
  • CsvSplit will throw an exception for incorrectly formatted strings.

If the CSV file that I receive isn’t in this format, then I just load it into Excel and save so that it is in this format.

If you have this for a source file:

Bob,"Bob said to go to the store."
Mary,"Mary said, \"Whatever.\""
Jim,Jim's quote doesn't contain quotes or commas.

Then you can query the CSV file like this:

var data = File.ReadAllLines("TextFile.txt")
.Select(
l => {
var split = l.CsvSplit();
return new {
Person = split[0],
Quote = split[1]
};
}
);

foreach (var item in data)
Console.WriteLine("{0}:{1}", item.Person, item.Quote);

The function is composable.  If you want to convert the CSV file to an XML file, you can do so like this:

XElement xmlDoc = new XElement("Root",
File.ReadAllLines("TextFile.txt")
.Select
(
line => {
var split = line.CsvSplit();
return new XElement("Quote",
new XElement("Person", split[0]),
new XElement("Text", split[1])
);
}
)
);
Console.WriteLine(xmlDoc);

Here is the listing for CsvSplit (also attached):

publicclassCsvParseException : Exception
{
public CsvParseException(string message) : base(message) { }
}

publicstaticclassMyExtensions
{
privateenumState
{
AtBeginningOfToken,
InNonQuotedToken,
InQuotedToken,
ExpectingComma,
InEscapedCharacter
};

publicstaticstring[] CsvSplit(thisString source)
{
List<string> splitString = newList<string>();
List<int> slashesToRemove = null;
State state = State.AtBeginningOfToken;
char[] sourceCharArray = source.ToCharArray();
int tokenStart = 0;
int len = sourceCharArray.Length;
for (int i = 0; i < len; ++i)
{
switch (state)
{
caseState.AtBeginningOfToken:
if (sourceCharArray[i] == '"')
{
state = State.InQuotedToken;
slashesToRemove = newList<int>();
continue;
}
if (sourceCharArray[i] == ',')
{
splitString.Add("");
tokenStart = i + 1;
continue;
}
state = State.InNonQuotedToken;
continue;
caseState.InNonQuotedToken:
if (sourceCharArray[i] == ',')
{
splitString.Add(
source.Substring(tokenStart, i - tokenStart));
state = State.AtBeginningOfToken;
tokenStart = i + 1;
}
continue;
caseState.InQuotedToken:
if (sourceCharArray[i] == '"')
{
state = State.ExpectingComma;
continue;
}
if (sourceCharArray[i] == '\\')
{
state = State.InEscapedCharacter;
slashesToRemove.Add(i - tokenStart);
continue;
}
continue;
caseState.ExpectingComma:
if (sourceCharArray[i] != ',')
thrownewCsvParseException("Expecting comma");
string stringWithSlashes =
source.Substring(tokenStart, i - tokenStart);
foreach (int item in slashesToRemove.Reverse<int>())
stringWithSlashes =
stringWithSlashes.Remove(item, 1);
splitString.Add(
stringWithSlashes.Substring(1,
stringWithSlashes.Length - 2));
state = State.AtBeginningOfToken;
tokenStart = i + 1;
continue;
caseState.InEscapedCharacter:
state = State.InQuotedToken;
continue;
}
}
switch (state)
{
caseState.AtBeginningOfToken:
splitString.Add("");
return splitString.ToArray();
caseState.InNonQuotedToken:
splitString.Add(
source.Substring(tokenStart,
source.Length - tokenStart));
return splitString.ToArray();
caseState.InQuotedToken:
thrownewCsvParseException("Expecting ending quote");
caseState.ExpectingComma:
string stringWithSlashes =
source.Substring(tokenStart, source.Length - tokenStart);
foreach (int item in slashesToRemove.Reverse<int>())
stringWithSlashes = stringWithSlashes.Remove(item, 1);
splitString.Add(
stringWithSlashes.Substring(1,
stringWithSlashes.Length - 2));
return splitString.ToArray();
caseState.InEscapedCharacter:
thrownewCsvParseException("Expecting escaped character");
}
thrownewCsvParseException("Unexpected error");
}
}

LinqToCsv.cs

Comments

  • Anonymous
    September 30, 2008
    If I'm reading that code correctly, quoted strings in your CSV files cannot contain backslashes unless they are escaping quotes? I find it very interesting that you decided against a functional approach, because when I wrote a CSV parsing method back in the nineties using VB4, I decided that the easiest way to do it would be a recursive function that parsed the first value, then passed the rest of the string to itself. It was not efficient, but it was easy to write. That was back before all the functional programming stuff appeared on my radar screen.

  • Anonymous
    October 01, 2008
    Hi Jeffrey, You are absolutely right, quoted strings could not contain backslashes unless escaping quotes.  I've updated the code so that backslashes can escape any character, including backslash. Regarding writing this using the functional approach, your idea is interesting.  However, given that I only take CSV files under duress, and then only for a quick and dirty solution to data transformation (generally, I immediately convert to an XML document), this is probably something I'm not going to do in the near future.  :-) But this raises the point - til now, I think that pattern matching of certain types is one of the areas where it is harder to use LINQ.  I'm still thinking about this area - what would a regex replacement look like using LINQ?  What are the appropriate LINQ patterns to use when you have items in a collection that depend on their context, which in turn depends on an enclosing context?  Grammars and recursive descent parsers are the cannonical way to do this - to write lexical parsers, for instance.  But what about just looking for patterns in a string? I've been thinking about approaches to specify patterns to match, and then having a method that yields up matches, but the implementation of the method that yields matches is an area that I haven't finalized.  I have ideas about how to do this with queries, but haven't fully developed them.  This is going to require more research.  (I'm sure that some of the FP experts out there have solved this problem; I just don't know the solution.) -Eric

  • Anonymous
    October 12, 2008
    Other sample in codeproject web site

  • Anonymous
    March 25, 2009
    The comment has been removed

  • Anonymous
    May 14, 2009
    Great post, It helped out immensely.  One question, would it be possible to make the return new {    Person = split[0],    Quote = split[1] }; dynamic in nature, ie use a configuration file to control the property and the location in the string to get the value from. thanks, jim

  • Anonymous
    May 14, 2009
    Hi Jim, it certainly would be possible - with .NET 3.5, one way to do this is using a Dictionary<T> per row.  With .NET 4.0, there will be dynamic types that will enable this in a very clean way. -Eric

  • Anonymous
    May 15, 2009
    thanks for the tip, we are using 3.5 now and i am anxiously awaiting 4.0 for quite a few things now :)

  • Anonymous
    June 26, 2009
    Hi Eric! Thanks for this post it's quite useful. There is one thing that I encountered when I used this for the first time, there is no whitespace (or at least space) support in the code. Because I'm writing much code, I did my sample file like this: Alias,Display Name,Email alias1, "test, me", asd@asd.com For the 2nd row you'll parse it into 4 parts: alias1 test me asd@asd.com which is "wrong". In my opinion, (white)spaces are allowed around a quoted string. For regular data I'm not sure, but if someone wants to use spaces in the data itself, then enclose the string in quotes like: "   spaces around me   ". So for the regular data spaces are allowed too, but they must be trimmed, and quoted strings should be trimmed only outside of the quotes. It was quite easy to add support spaces as I described above. I added this as the last check in the AtBeginningOfToken state:                        if (sourceCharArray[i] == ' ')                        {                            continue;                        } and this as the FIRST check in the ExpectingComma state:                        if (sourceCharArray[i] == ' ')                        {                            continue;                        } The following enhancements could make it more generic:

  • Add support for parameterized quote identifier(s)
  • Add support for parameterized separator
  • Add support for number parsing, based on number format (I know it's not definitely a split function, but can be :-)) Thanks, Attila
  • Anonymous
    June 17, 2010
    CSV split code is awesome, thanks dude! :)

  • Anonymous
    July 07, 2011
    How about an example of LINQ XML to CSV? Thanks