LINQ to DataSet - LinqDataView and indexes

Index usage

One of the features of the DataView that often gets overlooked is its ability to very quickly find a DataRow (or multiple DataRows) by an index value. The index is built on the sort criteria, so if you sort based on the last name column, the index will also be on the last name. In the 2.0 version of the DataSet, this was restricted to a column value. Now, with LinqDataView, this can be the result of a functional evaluation, that is, anything at all, which really opens up new possibilities.

Why an Index?

The nature of an index means that we can use it to find specific rows very quickly, based on the key of the index, or the value on which the sort occurs. We can see how this works with LinqDataView by performing a number of searches. Let’s say that we retrieve and cache a DataSet that contains our customer list, and we need to find the CustomerID from the Account number. The quick and dirty way to do this would be to have a method like the following.

private int FindCustomerID(string accountNum)

{

    var query = from cust in adventureWorksDS1.Customer

                where cust.AccountNumber == accountNum

                select cust;

    if (query.Count() > 0)

    {

  return query.First().CustomerID;

    }

    else

    {

        throw new Exception("Account Number not found!");

    }

}

This method works fine, and it might be ok if your load isn’t too high, and the results arrive quickly enough. However, you are repeatedly doing the same search so you can speed things up quite a bit by using an index. I am going to refer to the DataView as an index, but be aware that it’s not a special DataView, it’s just a regular old DataView (or LinqDataView).

static DataView index;

private int FindCustomerID(string accountNum)

{

    if (index == null)

    {

        var query = from cust in adventureWorksDS1.Customer

                    orderby cust.AccountNumber

                    select cust;

        index = query.AsDataView();

    }

    if (index.FindRows(accountNum).Count() > 0)

    {

        return index.FindRows(accountNum).First().Row.Field<int>("CustomerID");

    }

    else

    {

        throw new Exception("Account Number not found!");

    }

}

While this code is a lot longer, it will perform quite a bit better. Without an index, finding a particular row value will cost O(N) (as we might have to go through every DataRow) With the index, the cost to find the DataRow with the matching keys is O(log N). This is a huge gain, especially if your DataSet is fairly large.

How high is too high?

When should you use an index, as opposed to a much simpler query? Like most important questions in life – it depends. You will need to measure, and see if your code will speed up. For this simple example, the gains would likely be very small, and would likely not benefit from an index. However, if your sort criteria was something complex (long running calculations, etc), then the benefit shows up very quickly.

If you’d like to see more on this topic, please let me know. Stress and performance are complex beasts, and the usage of DataSet can have a big impact on your overall application.

Thanks,
Erick

Comments

  • Anonymous
    May 19, 2008
    Incidentally, one thing that immediately stuck out at me is this: if (query.Count() > 0)    {        return query.First().CustomerID;    }    else    {        throw new Exception("Account Number not found!");    } In general it is a terrible waste to call Count() instead of Any(), since it will have to enumerate the entire collection.   Anyway, in the end, I think just using Single() would be the better choice because it is doing the whole business of enforcing that there is one-and-only one result for you.

  • Anonymous
    May 20, 2008
    Thanks for the feedback. Any() would definitely work as would Single() (with a catch block and a rethrow for the correct exception in the case of no results). It's a good idea to get out of the habit of using Count. For this situation, it works fine, but in general, it can cause the entire sequence to be walked to get the count. Not good. :(

  • Anonymous
    May 21, 2008
    It is tough to get out of the habit - "you must unlearn what you have learned" : ) On the plus side, now with LINQ you can write what you mean more often and the code becomes more self-descriptive.