Jaa


Streaming in SQL Server using SQL CLR

The SQL Server engine execution operators are of two kinds: blocking and non-blocking. Blocking operators need to consume the complete dataset before returning an output. Think about the COUNT(*) operator: in order to tell you the result it has to go over all your rows. Another example of blocking operator is the Sort operator. Proper sorting requires the dataset to be completely sorted.

Non-blocking operators, on the contrary, can produce output before having processed all the input rows. Think about at this query:

 SELECT a,b, a+b FROM Table

SQL Server can tell the exact result of a+b of each row as soon as it has read the row itself. In this case SQL Server will not wait for the whole table to be read from disk and will stream the data to the output as soon as it’s ready.

As you might have guessed by now non-blocking operators are generally better: the less bottlenecks your execution plan has the better.

The SQL Server optimizer, of course, knows when it’s best to use – and where – these operators on its own. You should however be mindful of the implications of a blocking function in your code, especially when dealing with slow-to-retrieve datasets. As a real example suppose we want to access the OpenData Lombardia and perform some statistical calculations. The data is available as an URL: https://www.dati.lombardia.it/api/views/sd8x-w4h3/rows.csv?accessType=DOWNLOAD (rows are available in other formats such as JSON and XML but for the purpose of this article we will work with CSV).

In order to show a real example I’ve created a much bigger version of the recordset (exactly 1,305,487,617 bytes) replicating the original dataset.

Using the CLR we can access the remote recordset and return as SQL NVARCHAR rows. A first blocking approach can be:

 [SqlFunction(
    DataAccess = DataAccessKind.None,
    SystemDataAccess = SystemDataAccessKind.None,
    FillRowMethodName = "_StreamLine",
    IsDeterministic = false,
    IsPrecise = true,
    TableDefinition = (@"Line NVARCHAR(MAX)"))]
public static System.Collections.IEnumerable BlockingFileLine(
    SqlString fileName)
{
    List<string> lStreams = new List<string>();

    using (System.IO.FileStream fs = new System.IO.FileStream(
                    fileName.Value,
                    System.IO.FileMode.Open,
                    System.IO.FileAccess.Read,
                    System.IO.FileShare.Read))
    {

        using (System.IO.StreamReader sr = new System.IO.StreamReader(fs))
        {
            string str;
            while ((str = sr.ReadLine()) != null)
                lStreams.Add(str);
        }
    }

    return lStreams;
}

This undoubtedly works as expected generating a line-by-line TVF. However if we look at the memory footprint (I suppose a recently started instance for the sake of simplicity):

This is far from optimal. Look at the virtual_memory_committed_kb delta: it’s a LOT. SQL went from 13264 KB to 4132364 KB. It’s almost 4GB just for this function (~3.5 times the initial dataset)!

In SQL CLR functions we can – and should - use lazy evaluation instead: your main function is expected to return a System.Collections.IEnumerable instance. This is good: IEnumerable instances are not required to be able to tell the exact number of items in advance. This means that we can populate an IEnumerable while it’s being traversed. Using the streaming approach we don’t need to read all the file in advance. We can just read the relevant bits when requested.

In the previous example we returned a List (that implements IEnumerable). The list was populated in advance (hence the heavy memory usage). To use a streaming approach we need to return a custom class that implements IEnumerable directly:

 [SqlFunction(
    DataAccess = DataAccessKind.None,
    SystemDataAccess = SystemDataAccessKind.None,
    FillRowMethodName = "_StreamLine",
    IsDeterministic = false,
    IsPrecise = true,
    TableDefinition = (@"Line NVARCHAR(MAX)"))]
public static System.Collections.IEnumerable StreamFileLine(
    SqlString fileName)
{
    System.IO.FileStream fs = new System.IO.FileStream(
                    fileName.Value,
                    System.IO.FileMode.Open,
                    System.IO.FileAccess.Read,
                    System.IO.FileShare.Read);

    return new LineStreamer(fs);
}

Here our LineStreamer class is just a wrapper around our real implementation:

 public class LineStreamer : Streamer, System.Collections.IEnumerable
{
    public LineStreamer(System.IO.Stream stream)
        : base(stream)
    { }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return new LineStreamerEnumerator(stream);
    }
}

The real code is in LineStreamerEnumerator class. That class will implement System.Collections.IEnumerator allowing it to be traversed in a foreach loop. In order to achieve streaming we should produce a row only when explicitly requested. We should also discard any non essential information in order to minimize the memory footprint.

Here we go:

 public class LineStreamerEnumerator : System.Collections.IEnumerator, IDisposable
{
    protected System.IO.StreamReader sr = null;
    protected string _CurrentLine = null;

    public LineStreamerEnumerator(System.IO.Stream s)
    {
        sr = new System.IO.StreamReader(s);
    }

    object System.Collections.IEnumerator.Current
    {
        get { return _CurrentLine; }
    }

    bool System.Collections.IEnumerator.MoveNext()
    {
        _CurrentLine = sr.ReadLine();
        return _CurrentLine != null;
    }

    void System.Collections.IEnumerator.Reset()
    {
        throw new NotImplementedException();
    }

    #region IDisposable / destructor
    void IDisposable.Dispose()
    {
        if (sr != null)
        {
            sr.Dispose();
            sr = null;
        }
    }

    // Finalizers are not supported in SQL CLR
    //~LineStreamerEnumerator()
    //{
    //    if (sr != null)
    //        sr.Dispose();
    //}
    #endregion
}

Notice that I’ve implemented the IDisposable interface explicitly. SQL Server will call Dispose on our enumerator as soon as it’s done with it. This will give us a chance to cleanup any lingering streams. We cannot use finalizers in SQLCLR as the example above illustrated.

Let’s try streaming in action:

  

Here the memory delta is a LOT less (107728 – 16844 = 90884 kb). We lowered our memory footprint from 4GB to 100MB.

Impressive isn’t it?

Of course this is an ad-hoc example so you should not expect such big gains in real world. It is however a good practice to avoid blocking operators whenever possible.

More on the topic:

Remember also that most .Net streams have automatic buffering so you don’t have to worry about it. This feature fits nicely in the streaming pattern; try it using the System.Net.WebRequest to understand why (or just look here if you’re lazy :)).

 

Happy Coding,

Francesco Cogno

Comments

  • Anonymous
    April 09, 2014
    where is code for _StreamLine
  • Anonymous
    April 09, 2014
    Can you provide the zip file file containing the Visual Studio Solution.
  • Anonymous
    April 09, 2014
    Hi SQLClrGuy, the _StreamLine method is here: https://sqlservertoazure.codeplex.com/SourceControl/latest#ITPCfSQL.Azure.CLR/Streaming/Stream.cs .
    I suggest you to get the whole solution from here https://sqlservertoazure.codeplex.com/ (it's an open source project); alle the relevant classes are in the ITPCfSQL.Azure.Streaming namespace.

    Cheers,
    Francesco
  • Anonymous
    August 25, 2015
    This is great, however the real life implantation of SQLCLR streaming is very limited if your input data is read from the database.
    I have a project where I want to implement the following common streaming scenario:
    1. Open a dataReader
    2. Read one row
    3. Process data
    4. Stream output
    5. Read next data
    6. Process data
    7. Stream output
    8. Continue until all data is read
    9. Close data reader
    Unfortunately, SQLCLR cannot handle this very basic scenario since it does not allow you to start streaming data out unless the dataReader is closed. Therefore, in order to use streaming I need to load all input data into memory and close my dataReader before I start streaming data out. This washes off any memory gains from using streaming.
    Is there a better solution for this?
  • Anonymous
    November 08, 2015
    Awesome example of enumerating the StreamReader as a list of strings...you rock!!