The eXcel Files: Using the OLEDB provider to manipulate Excel/Access/Project files

In a previous post , I outlined the reasons why using the Office Object Model on the server is a bad idea. Some readers pointed out that third parties sell components which can understand Word/Excel/Access/Project files and convert them to XML.

I personally do not like the idea of processing an Office document within BizTalk. Basically, if you send Office documents as messages to BizTalk, you are essentially duplicating the data (whoever sent out the document still has a copy of it and can alter it) and this can be a source of inconsistencies. The less copies of the same data, the easier it is. Think about what happens when you send a Word document to a few reviewers: someone has to merge changes into one document at some point.

Instead of sending an Office document, I prefer to send a reference (an URL to the document stored in some repository for instance). However, I understand that under some circumstances, the document is "swallowed" by the process: the document comes as an Excel file, the process converts it to XML and discard the Excel file. In that case, it might be fine since there is not really a copy of the Office document. In this post, I will try to explain how one can read (and possibly write, with some caveats) Excel/Access/Project files.

It turns out that the OLEDB provider for JET can read/write Excel files. This capability has been available for quite some time and Knowledge Base article 306572 describes how this task can be achieved from ASP.NET. Since this is an OLEDB provider, it does not use the Office Object Model and therefore does not have its limitations. However, as we shall see soon, the Jet provider for Excel has limitations as well. Here is a non exhaustive list:

  1. the Excel workbook needs to be "prepared" for the Jet provider to work fine: it is not like you can take any random Excel file and start accessing it,
  2. you specify which file you would like the provider to act on. That's right, a file. in BizTalk, everything is a stream, not a file so you will have to create a temporary file out of the stream.

The Jet provider for Excel is documented, supported for usage on the server and comes at no extra cost: it may very well do the job you need to be done.

As you have probably figured, opening an Excel file with the OLEDB provider is going to make your Excel spreadsheet look like a database. This means that you first need to define tables in Excel. Then you can access the data in these tables. So let's see how this all works.

Preparing the Excel file

Before you can read data from the Excel file, you need to select a range of cells that will be seen as a table. This process is called "Naming a range of cells". You can name a set of cells by following these steps (see screen shot below):

  1. Ensure that all columns you plan to see as a table have headers. In my example below, the headers are "LineItem", "Price" and "Number". This is required since the provider will assume the first row gives names to columns,
  2. Select C4:E6 and in the Insert Menu, select Name->Define ...,
  3. Ensure that all columns you plan to see as a table have headers. In my example below, the headers are "LineItem", "Price" and "Number". This is required since the provider will assume the first row gives names to columns,
  4. Inside the "Names in workbook" field, type the name of a table, for instance "Orders" and click "Add" then "OK".

You can repeat steps 1 to 4 as many times as you want with different cell ranges if you wish to expose more than one table.

Writing the custom disassembler

To write the custom disassembler, it helps if we first decide how we will convert this table to XML. For this example, I'll just convert each row to the following XML:

<NamedTables>
<Table name = "<name>">
<Row <field1_name>="<field1value>" <field2_name>="<field2value>" ... />
[...]
</Table>
[...]
</NamedTables>

So for the previous example, this should output the following XML:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<NamedTables>
<Table name="Orders">
<Row LineItem="Handset" Price="48.7" Number="2" />
<Row LineItem="Battery" Price="125.8" Number="1" />
</Table>
</NamedTables>

As most of you know, a custom disassembling component should implement IDisassemblerComponent. I will concentrate on how one could implement this interface. We first need to save the stream we receive to a temporary file. This is achieved by the following private method:

/// <summary>
/// Save the given stream as a file in the temporary directory.
/// </summary>
/// <param name="inStream">Stream to save.</param>
/// <returns>Full path to the file.</returns>
private string SaveStreamAsFile(Stream inStream)
{
// Put files into the temporary directory
string tempPath = Path.Combine(Path.GetTempPath(), Path.GetTempFileName());

// Send the whole stream to the temporary file
FileStream outFileStream = new FileStream(tempPath, FileMode.OpenOrCreate, FileAccess.Write, FileShare.None);

try
{
int bytesRead;
byte[] buffer = new byte[bufferSize];

// Save the whole file
while ((bytesRead = inStream.Read(buffer, 0, buffer.Length)) != 0)
outFileStream.Write(buffer, 0, bytesRead);
}
catch(Exception)
{
// Close the stream and get rid of the temporary file
outFileStream.Close();
try { File.Delete(tempPath); } catch (Exception) { /* Ignore failure deleting the file */ }

// Make sure we do not return any path
outFileStream = null;
tempPath = null;
}
finally
{
// Close the stream no matter what happens
if (outFileStream != null) { outFileStream.Close(); }
}

return tempPath;
}

I should not be catching all exceptions here with a catch(Exception)block. This is a practice that goes against the "catch only exceptions you know about". If you intend to use this code in a production environment, you should not catch all exceptions here.

The Disassemble method just remembers the message stream and clones the message context for further usage:

public void Disassemble(IPipelineContext pContext, IBaseMessage pInMsg)
{
// Clone the message context so we can associate it with the new message later
originalMsgContext = PipelineUtil.CloneMessageContext(pInMsg.Context);

// Remember the input message instance
inputMsg = pInMsg;
}

Now comes the GetNext method. This is where most of the work happens. Basically, we connect to the provider with the following connection string (%{0}% has been replaced with the full path to the temporary .xls file):

private const string connectionString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source= %{0}% ;Extended Properties=Excel 8.0;";

We first save the stream to a temporary location and we produce the desired XML using the ExcelNameToXml method. Note that a real component would have a much more elaborate error detection and reporting than mine below. My error handling is minimal at best.

/// <summary>
/// Get the result of the disassembling process.
/// </summary>
/// <param name="pContext">IPipelineContext: context information.</param>
/// <returns>IBaseMessage: a message or null if no more messages to be produced.</returns>
public IBaseMessage GetNext(IPipelineContext pContext)
{
IBaseMessage newMsg = null;

// We handle messages only if there is a part
if ((inputMsg != null) && (inputMsg.PartCount >= 1))
{
// Currently, we process only the body
if (inputMsg.BodyPart != null)
{
// Save the input message as an xsl file so we can use the provider
string fullPath = SaveStreamAsFile(inputMsg.BodyPart.GetOriginalDataStream());
OleDbConnection oleDbConnection = null;
MemoryStream outputStream = null;
XmlTextWriter xmlWriter = null;

try
{
// Connect to the storage (Excel)
oleDbConnection = new OleDbConnection(connectionString.Replace("%{0}%", fullPath));
oleDbConnection.Open();

// Create a new message
newMsg = pContext.GetMessageFactory().CreateMessage();

           // Get a new stream so we can prepare the XML and connect it to an XMLTextWriter
outputStream = new MemoryStream();
xmlWriter = new XmlTextWriter(outputStream, Encoding.UTF8);

// Prepare the message: We combine all tables into one XML document
// like the following:
// <NamedTables>
// <Table name = "<name>">
           // <Row <field1_name>="<field1value>" <field2_name>="<field2value>" ... />
// [...]
// </Table>
// [...]
// </NamedTables>

xmlWriter.Formatting = Formatting.Indented;

xmlWriter.WriteStartDocument(true);
xmlWriter.WriteStartElement("", "NamedTables", "");

// Access the table we need
ExcelNameToXml("Orders", oleDbConnection, xmlWriter);
}
catch(Exception)
{
// Get rid of the input stream
if (outputStream != null) { outputStream.Close(); outputStream = null; }

// We do not need a reference to the cached objects
inputMsg = null;
originalMsgContext = null;

throw;
}
finally
{
// Close the document
if (xmlWriter != null)
{
xmlWriter.WriteEndElement();
xmlWriter.WriteEndDocument();
xmlWriter.Flush();
}

// Close and delete the connection
if (oleDbConnection != null)
{
// Close the connection if needed
if (oleDbConnection.State != ConnectionState.Closed)
oleDbConnection.Close();

oleDbConnection.Dispose();
}

// Don't let the file in the temporary directory
try { File.Delete(fullPath); } catch (Exception) { /* Ignore failure deleting */ }
}
// Prepare to return the message
if (newMsg != null)
{
// Rewind the stream so it is ready for use by the messaging engine
outputStream.Seek(0, SeekOrigin.Begin);

// Add one part to it. This part will be the "body"
IBaseMessagePart newMsgPart = pContext.GetMessageFactory().CreateMessagePart();
newMsgPart.Charset = "UTF-8";
newMsgPart.ContentType = "text/xml";
newMsgPart.Data = outputStream;
newMsg.AddPart("body", newMsgPart, true);

// Copy the original message context to the newly created message
for (int iProp = 0; iProp < originalMsgContext.CountProperties; iProp++)
{
string strName;
string strNSpace;
object val = originalMsgContext.ReadAt(iProp, out strName, out strNSpace);

// If the property has been promoted, respect the settings
if (originalMsgContext.IsPromoted(strName, strNSpace))
newMsg.Context.Promote(strName, strNSpace, val);
else
newMsg.Context.Write(strName, strNSpace, val);
}

// There is a new stream so let's make it visible to the resourceTracker
pContext.ResourceTracker.AddResource(outputStream);

// There are no more messages to process - We do not need a reference to the context
originalMsgContext = null;
inputMsg = null;
}
}
}

return newMsg;
}

The method is easily implemented by using an OleDbCommand object and getting an OleDbDataReader after command execution as shown below. Note that again, the error handling should be vastly improved if this code was to be run in production.

/// <summary>
/// Executes a query agains the OLEDB provider and returns a reader.
/// </summary>
/// <param name="query">Query to execute.</param>
/// <param name="oledbConn">OleDbConnection to use.</param>
/// <returns>OleDbDataReader after execution of the command.</returns>
private OleDbDataReader RunQuery(string query, OleDbConnection oledbConn)
{
// Create a command and get a reader to retrieve data
OleDbCommand oleDbCmd = new OleDbCommand(query, oledbConn);
OleDbDataReader oleDbReader = oleDbCmd.ExecuteReader(CommandBehavior.SequentialAccess);

return oleDbReader;
}

/// <summary>
/// Converts an Excel table to an XML island.
/// </summary>
/// <param name="tableName">Name of the Excel table.</param>
/// <param name="oledbConn">OleDbConnection to use.</param>
/// <param name="xmlWriter">XmlTextWriter to use when producing the output XML.</param>
private void ExcelNameToXml(string tableName, OleDbConnection oledbConn, XmlTextWriter xmlWriter)
{
OleDbDataReader oleDbReader = RunQuery(selectCommand.Replace("%{0}%", tableName), oledbConn);

try
{
// We produce the following XML island
// <Table name = "<name>">
// <Row <field1_name>="<field1value>" <field2_name>="<field2value>" ... />
// [...]
// </Table>

// Table node
xmlWriter.WriteStartElement("", "Table", "");
xmlWriter.WriteAttributeString("name", tableName);
// No point writing data if there are no row
if (oleDbReader.HasRows)
{
// Prepare the list of fields
ArrayList fields = new ArrayList(oleDbReader.FieldCount);
for (int iField = 0; iField < oleDbReader.FieldCount; iField++)
fields.Add((object) oleDbReader.GetName(iField));

// Emit all rows
while (oleDbReader.Read() == true)
{
xmlWriter.WriteStartElement("", "Row", "");

for(int iPos = 0; iPos < fields.Count; iPos++)
xmlWriter.WriteAttributeString((string) fields[iPos], oleDbReader.GetValue(iPos).ToString());

xmlWriter.WriteEndElement();
}
}

xmlWriter.WriteEndElement();
}
finally
{
if (oleDbReader != null) { oleDbReader.Close(); }
}
}

Where is the source for this?

The complete source code can be downloaded here . Remember that this code is offered "AS IS", without any warranty of any kind. Also, as explained before, the error handling in this example is weak at best. It should be improved if you plan to run this in production.

How about Access or Project?

This would work too. You just need to adjust the connection string. This page list many possible providers for various documents/databases.

Comments