Reading Data from SpreadsheetML

In my last post, I showed you how to create a spreadsheet based on data from a database using the Open XML SDK. Today I am going to show you how to do the reverse, specifically, how to read data from a spreadsheet and insert it into some data source, like a database.

Again, if you have any specific scenarios or solutions you would like me to address in future posts please let me know.

Scenario

Today's scenario theme is document interrogation, which is all about reading data from a file.

Imagine a scenario where I'm a developer working for a fictional company called Contoso. At my company, the sales team uses Excel to create sales orders for customers. These sales orders are all based on an Excel template that has designated regions that keep track of data such as customer id, invoice number, items being purchased, total dues, etc. The sales team typically creates hundreds of sales orders a day. My company has asked me to create a solution that is able to bulk export data from these sales orders into a database. The company wants this solution to be run every night on the server, so automating Excel is not an option.

Solution

If you want to skip ahead and just download the code + Excel template just click here.

Step 1 – Create a Template File

The scenario I listed above talks about reading sales orders that are based off of a workbook template. In this sales order template I will have designated regions where I expect people to fill in information. To simplify my coding and to ensure that my code is more robust, I am going to take advantage of defined name regions as specified in SpreadsheetML. Defined name regions allow me to "tag" one or more cells within my spreadsheet as having semantic meaning. For example, I can specify a defined name region for a cell that will have a value for customer id. The advantage of using defined names is that I won't have to rely on looking up hardcoded columns and rows. For example, if a cell that contains my customer id changes from being B10 to B11 because a user added a row to the spreadsheet, I am still able to find that specific cell due to the defined name tag.

In any case, I have created a simple template that looks like the following:

 

Step 2 – Building a Defined Names Table

Before I can use the SDK, I need to add the appropriate namespaces to our C# file. In this case, I need to add my Packaging and Spreadsheet API components as follows:

using DocumentFormat.OpenXml.Packaging; using DocumentFormat.OpenXml.Spreadsheet; using DocumentFormat.OpenXml;

Since my sales order file contains defined name regions, the first thing I will need to do is find all the defined names specified in my file. In SpreadsheetML, workbook-level defined names are stored in the main workbook part as follows:

<definedNames>

<definedName name="CustomerCompanyName">'Sales Order'!$B$13</definedName>

<definedName name="CustomerID">'Sales Order'!$E$9</definedName>

<definedName name="CustomerName">'Sales Order'!$B$12</definedName>

<definedName name="ItemsRange">'Sales Order'!$A$22:$G$36</definedName>

</definedNames>

The defined name is stored as a string that I will need to parse. What I am going to do in my solution is read each of these defined names and then parse it out into the following components:

  • Defined name string (I will use this string as a reference to my data)
  • Sheet name
  • Start column
  • Start row
  • End column (only present if defined name is a range)
  • End row (only present if defined name is a range)

I am going to then use this information to then find the value or sets of values that make up the defined name region. Note that my code assumes that no one changes the defined names or adds more defined names with the string "ItemsRange". I can accomplish this task with the following code:

using (SpreadsheetDocument myWorkbook = SpreadsheetDocument.Open(filename, true)) { //Access the main Workbook part, which contains all references WorkbookPart workbookPart = myWorkbook.WorkbookPart; //get all the reference names List<DefinedNameVal> namesTable = BuildDefinedNamesTable(workbookPart); ... }

Where I define the BuildDefineNamesTable method as follows (note that I created a class to store all my information called DefineNamesVal):

static List<DefinedNameVal> BuildDefinedNamesTable(WorkbookPart workbookPart) { //Build a list List<DefinedNameVal> definedNames = new List<DefinedNameVal>(); foreach (DefinedName name in workbookPart.Workbook.GetFirstChild<DefinedNames>()) { //Parse defined name string... string key = name.Name; string reference = name.InnerText; string sheetName = reference.Split('!')[0]; sheetName = sheetName.Trim('\''); //Assumption: None of my defined names are relative defined names (i.e. A1) string range = reference.Split('!')[1]; string[] rangeArray = range.Split('$'); string startCol = rangeArray[1]; string startRow = rangeArray[2].TrimEnd(':'); string endCol = null; string endRow = null; if (rangeArray.Length > 3) { endCol = rangeArray[3]; endRow = rangeArray[4]; } definedNames.Add(new DefinedNameVal() { Key = key, SheetName = sheetName, StartColumn = startCol, StartRow = startRow, EndColumn = endCol, EndRow = endRow }); } return definedNames; }

Note that my code above assumes that none of my defined names are relative defined names. In other words, my defined name ranges contain "$" between the column and row. Okay, at this point I have all the necessary information for me to look up the values for my defined name regions.

Step 3 – Access Worksheet Part Specified by Defined Name Regions

Based on my Excel template, I have two types of data:

  1. Defined names that specify values contained within one cell, for example, customer id
  2. Defined names that specify a region of values, which in this case has the inventory information for the sales order. For example, what is being ordered, how many, how much, etc.

In either case, the value for my defined name is contained in the worksheet specified by the defined name, which I already found. Here is a little method I wrote that returns back the worksheet part based on the defined name data:

private static WorksheetPart GetWorkSheetPart(WorkbookPart workbookPart, DefinedNameVal definedName) { //get worksheet based on defined name string relId = workbookPart.Workbook.Descendants<Sheet>() .Where(s => definedName.SheetName.Equals(s.Name)) .First() .Id; return (WorksheetPart)workbookPart.GetPartById(relId); }

Step 4 – Get Cell Values Specified by Defined Name Regions

Once I have the worksheet part and the cell reference (or region of cells) as specified in the defined name region I can get the cell values. I have two types of values stored in this sales order: numbers and strings. In SpreadsheetML, cell values that are strings can be represented as inline strings within the cell or stored within the shared string table that can be found in the sharedstrings part.

The first step in retrieving a specific cell value based on column and row index is to first find that cell. The second step is then to look up the value for that cell. The following methods will retrieve the cell value based on the cell reference:

static string GetCellValue(WorksheetPart worksheetPart, SharedStringTablePart stringTablePart, string startCol, string startRow) { string reference = startCol + startRow; //get exact cell based on reference Cell cell = worksheetPart.Worksheet.Descendants<Cell>() .Where(c => reference.Equals(c.CellReference)) .First(); return GetValue(cell, stringTablePart); }

If the cell is based on a shared string then you need to look up the appropriate value from the shared string table, otherwise you only have to look at the cell value. The following code accomplishes this task:

static string GetValue(Cell cell, SharedStringTablePart stringTablePart) { if (cell.ChildElements.Count == 0) return null; //get cell value string value = cell.CellValue.InnerText; //Look up real value from shared string table if ((cell.DataType != null) && (cell.DataType == CellValues.SharedString)) value = stringTablePart.SharedStringTable .ChildElements[Int32.Parse(value)] .InnerText; return value; }

Step 5 – Putting it All Together

Now that I have all the basics in place I can put everything together. I need to go through all the defined names and find the appropriate values from the spreadsheet. For the sake of simplicity, instead of writing all found values into a database I am going to instead print out the data to my console. Yes, I got a bit lazy for creating a database to store this data.

As mentioned before, I have two types of data based on my defined names: single cell values and range values based on the sales order. I am going to handle my first type of data, the single cell values, with the following code:

//Create inventory list DefinedNameVal inventory = null; //Go through all the defined names I found foreach (DefinedNameVal definedName in namesTable) { string cellValue = null; //Get the worksheet as specified in my defined name WorksheetPart worksheetPart = GetWorkSheetPart(workbookPart, definedName); //Not inventory list... if (!definedName.Key.Contains("Items")) { //Get the cell value cellValue = GetCellValue(worksheetPart, stringTablePart, definedName.StartColumn, definedName.StartRow); //If cell value is date then convert value to a human readable date if (definedName.Key.Contains("Date")) { DateTime d = new DateTime(1900, 1, 1); d = d.AddDays(Int32.Parse(cellValue) - 1); cellValue = d.ToShortDateString(); } Console.WriteLine(definedName.Key + ":\t " + cellValue); } else { // Remember this for later inventory = definedName; } }

Notice in the code above that I am converting a SpreadsheetML stored date as a human readable date. Since I am just going to print out my found values I am going to keep track of my inventory values as a separate list. I will retrieve my inventory values with the following code:

//Get the worksheet as specified in my inventory WorksheetPart worksheetPart = GetWorkSheetPart(workbookPart, inventory); //Get all data from my inventory int startRow = Int32.Parse(inventory.StartRow); int endRow = Int32.Parse(inventory.EndRow); for (int i = startRow; i <= endRow; i++) { string[] rowValues = new string[7]; //for printing... bool empty = true; //Get a row from the sales order Row row = worksheetPart.Worksheet.Descendants<Row>() .Where(r => i == r.RowIndex) .First(); int j = 0; //Get all the values from my row foreach (Cell c in row.ChildElements) { //Get cell value string value = GetValue(c, stringTablePart); if ((value != null) && (value != "")) empty = false; rowValues[j] = value; j++; } if (empty) break; Console.WriteLine(inventory.Key + i + ": " + string.Join(", ", rowValues)); }

In the code above, I am reading my entire inventory sales order until I reach an empty row, which means the list is complete.

End Result

Imagine I take a sales order with the following information:

Running my code I will end up with the following output (remember I decided to print my information out rather than store the values in a database:

Next Time

In my next few posts I am going to walk through other solutions to some key scenarios. Suggestions are always welcome.

Zeyad Rajabi

Comments

  • Anonymous
    November 17, 2008
    I'd like to get in touch with Zeyad and run some scenarios by him for future 2.0 examples.  What is the best method of contacting him?  His PDC presentation was excellent and we're really looking forward to using 2.0.  Thanks. don.capuano@thomsonreuters.com

  • Anonymous
    November 18, 2008
    In my last two posts, I showed you how to create a spreadsheet based on data and how to read data from

  • Anonymous
    November 18, 2008
    Hi Don, Thanks for the feedback. I will contact you by email. Zeyad Rajabi

  • Anonymous
    November 18, 2008
    Dernier post de cette série sur la suppresion des commentaires dans les documents PowerPoint 2007 (PresentationML).

  • Anonymous
    November 21, 2008
    Comme à l'accoutumé, voici une brochettes de liens de la semaine sur Open XML. Posts techniques en vrac

  • Anonymous
    November 25, 2008
    Over on Brian Jones' blog there are two recent articles that may be of interest to our readers here.