Writing Large Excel Files with the Open XML SDK
In my last post, I showed you guys how to use the Open XML SDK to read and parse large Excel files. Today, I am going to show you how to write large Excel files with the Open XML SDK. Like in the case of reading, the Open XML SDK provides two approaches for writing Open XML files: 1. using the DOM, 2. using the SAX-like writing capability. All of my examples on this blog have been around using the Open XML SDK DOM to write Open XML files. The DOM makes it easy because everything is strongly typed. Unfortunately, the DOM approach requires loading entire Open XML parts in memory, which can result in Out of Memory exceptions, especially when dealing with very large files. Let's compare the DOM and SAX-like approaches.
If you want to jump straight into the code, feel free to download the solution here.
Scenario
Imagine a scenario where I work for a financial company that deals with very large amounts of data. This data gets updated on a daily basis and is stored in a database. The company's analysts use Excel as a means of mining and understanding the data. To support the analysts' needs, the company needs to create and update Excel spreadsheets every morning. These spreadsheets typically grow to sizes exceeding hundreds of megabytes.
Solution
The scenario described above uses many of the general concepts described in a previous post on writing data from a database into an Excel file. Today, I will show you how to accomplish the scenario using both the DOM and SAX-like approaches. To accomplish the scenario we will need to take the following actions:
- Create an empty Spreadsheet template using Excel
- Open the template using the Open XML SDK
- Connect to the database that contains all the relevant data (for the sake of simplicity, this blog post will instead write random values to cells)
- Get the main workbook part
- Get the appropriate worksheet
- Get the sheetData element, which will contain the data for a given spreadsheet
- For the number of rows and columns that represent the dimensions of data to be inserted, add a new cell with appropriate data
- Append rows to the sheetData element
- Save and close
Step 1 – Setup the right template
As is the case for most of my blog posts, the easiest way to start any Open XML SDK solution is to create the right starting template. In this case, I am going to create a blank Excel spreadsheet with one sheet in Excel. The template will look like the following:
Open XML SDK DOM Approach
The SDK DOM approach is pretty straight forward. The method below will write random values to cells within a given set of dimensions specified by the number of rows and columns. The random values will be calculated by Excel using the RAND() formula, which is specified by the CellFormula class.
static void WriteRandomValuesDOM(string filename, int numRows, int numCols) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { WorkbookPart workbookPart = myDoc.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First(); for (int row = 0; row < numRows; row++) { Row r = new Row(); for (int col = 0; col < numCols; col++) { Cell c = new Cell(); CellFormula f = new CellFormula(); f.CalculateCell = true; f.Text = "RAND()"; c.Append(f); CellValue v = new CellValue(); c.Append(v); r.Append(c); } sheetData.Append(r); } } } |
The method above will definitely work for writing cell values within a worksheet. However, if the dimension specified is quite large then the program's memory footprint will also be quite large. In fact, you are left at the mercy of the garbage collector, which may result in the program throwing an Out of Memory exception.
Open XML SDK SAX-Like Approach
The OpenXmlReader and OpenXmlWriter can be used together to alleviate some of the performance and/or memory concerns when developing an Open XML solution. You can create an OpenXmlReader and an OpenXmlWriter object on an entire part without loading the part in memory. The OpenXmlReader class allows you to iterate the entire xml tree of a part one node at a time. Similarly, the OpenXmlWriter class allows you to write xml nodes within a part one node at a time.
In the case of this solution, we will need to read the original worksheet part and then update the worksheet with the appropriate data. We are going to use an OpenXmlReader object to read the original worksheet part and find the place in the file where we will add data. We are then going to use an OpenXmlWriter object to write values to the worksheet part. You may be tempted to create an OpenXmlReader object and an OpenXmlWriter object on the same part. I would recommend against using this strategy. Writing data to a part that you are reading using the OpenXmlReader and the OpenXmlWriter may result in the OpenXmlReader losing its place in the xml tree. Instead, you should create a temporary worksheet part where you can write values to using the OpenXmlWriter class. The following code creates an OpenXmlReader object and an OpenXmlWriter object for worksheet parts:
static void WriteRandomValuesSAX(string filename, int numRows, int numCols) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { WorkbookPart workbookPart = myDoc.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); string origninalSheetId = workbookPart.GetIdOfPart(worksheetPart); WorksheetPart replacementPart = workbookPart.AddNewPart<WorksheetPart>(); string replacementPartId = workbookPart.GetIdOfPart(replacementPart); OpenXmlReader reader = OpenXmlReader.Create(worksheetPart); OpenXmlWriter writer = OpenXmlWriter.Create(replacementPart); ... } } |
Once we have the appropriate objects created, the next step is to write everything we read into the new worksheet part until we reach the SheetData object. Once we reach the SheetData object, we can start writing cell values within a given set of dimensions specified by the number of rows and columns. The following code accomplishes this task:
static void WriteRandomValuesSAX(string filename, int numRows, int numCols) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { ... Row r = new Row(); Cell c = new Cell(); CellFormula f = new CellFormula(); f.CalculateCell = true; f.Text = "RAND()"; c.Append(f); CellValue v = new CellValue(); c.Append(v); while (reader.Read()) { if (reader.ElementType == typeof(SheetData)) { if (reader.IsEndElement) continue; writer.WriteStartElement(new SheetData()); for (int row = 0; row < numRows; row++) { writer.WriteStartElement(r); for (int col = 0; col < numCols; col++) { writer.WriteElement(c); } writer.WriteEndElement(); } writer.WriteEndElement(); } else { if (reader.IsStartElement) { writer.WriteStartElement(reader); } else if (reader.IsEndElement) { writer.WriteEndElement(); } } } ... } } |
Notice that the OpenXmlWriter class requires us to write both the start and end elements. Invoking the method WriteEndElement() off of the OpenXmlWriter class ensures that the appropriate start element is closed.
We're almost done. The last step is to close the reader and writer and replace the original worksheet part with the replacement part. Remember that the main workbook part contains the list of worksheets within the workbook, which means we need to update the xml contained in that part. The following code accomplishes these tasks:
static void WriteRandomValuesSAX(string filename, int numRows, int numCols) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { ... reader.Close(); writer.Close(); Sheet sheet = workbookPart.Workbook.Descendants<Sheet>() .Where(s => s.Id.Value.Equals(origninalSheetId)).First(); sheet.Id.Value = replacementPartId; workbookPart.DeletePart(worksheetPart); } } |
Conclusion
This post gives you more insight into reading and writing large Open XML files, while avoiding performance or Out of Memory exceptions. Let me know if you guys have other Open XML SDK related post requests.
Zeyad Rajabi
Comments
Anonymous
June 24, 2010
Hi Brian, I have been watching these blogs for a while am so disappointed, that there is nothing happening to support Java - Word interoperability! I don't think in my life time I can write an API to manipulate the vast xml file(though if I want I can very well write, but seriously don't have any interest, if word is going to release its final version, let me write then) that gets generated in word! And I won't really would go and look into SharePoint and use it as it doesn't fits to my requirement at all. So, what can be done? Do people need to go for OpenOffice and eventually force the clients and employees to adopt to it and finally go for MacOS or Solaris or some other OS?... and be happy and see more objects... deal with them and live happily :-)? Thanks in advance for whatever kind of reply that is going to come. And it's disappointing that for years Word did not bother to answer to Java folks!Anonymous
June 24, 2010
The comment has been removedAnonymous
June 24, 2010
The comment has been removedAnonymous
June 25, 2010
Really annoyed by the information from the blogger over here ooxmlisdefectivebydesign.blogspot.com/.../shaving-off-standard-xml-for.html :-)! Just found out about Custom XML which gave some hope and landed on to the blog above :-)! Hope Custom XML is still supported by word! Please let me know it's there are not?Anonymous
July 07, 2010
Hi Brian! First of all: Thank You very much for your posts -> they are outstanding and help me a lot in my work. I realise that you are probably overworked but maybe you will find a minute to answer my question: imagine that you've got a spreadsheet which has got xml mapping, is there any simple way to programaticaly retrieve data from mapped cells? I mean export data back to xml? Unfortunately I can't do this using Excel automation because it is going to be a ASP.NET server job (Microsoft seriously warns not to use office automation on server side). Thanks for help!Anonymous
July 08, 2010
Hi, Brian & Zeyad, i like the way you explain all the things. rally helpful for beginners like me.