Udostępnij za pośrednictwem


Parsing and Reading Large Excel Files with the Open XML SDK

I want to thank you guys for leaving blog post requests. Today's post will focus on one of your requests: parsing and reading large Excel files with the Open XML SDK. The Open XML SDK provides two approaches for parsing Open XML files: 1. using the SDK DOM, 2. using the SAX-like reading/writing capabilities. The SDK DOM makes it easy to query and parse Open XML files due to strongly typed classes. Unfortunately, the DOM approach requires loading entire Open XML parts in memory, which can sometimes cause Out of Memory exceptions when dealing with really large files. The SAX-like reading/writing capability of the SDK provides a faster and less memory intensive approach. Let's compare the two approaches.

If you want to jump straight into the code, feel free to download the solution here.

Scenario

Imagine a scenario where I work for a financial company that deals with very large Excel spreadsheets. These spreadsheets are updated daily by analysts and can easily grow to sizes exceeding hundreds of megabytes. At the end of the day a solution is used to read and extract relevant data from every spreadsheet.

Here is a screenshot of an example workbook with millions of cells of data:

image

Solution

The scenario described above is pretty simple and involves many of the general concepts described in a previous post on reading data from Excel files. The main difference in today's post is that I will show you the solution using both the DOM and SAX-like approaches. To accomplish the scenario we will need to take the following actions:

  1. Open the workbook using the Open XML SDK
  2. Get the main workbook part
  3. Get the appropriate worksheet
  4. Read every cell value in the sheet

Open XML SDK DOM Approach

The SDK DOM approach is pretty straight forward. The method below will read every cell value for each row contained within the sheet data.

Note that code below works for cells that contain numeric values. Cells that contain other types of content will need a different type of solution.

static void ReadExcelFileDOM(string filename) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { WorkbookPart workbookPart = myDoc.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First(); foreach (Row r in sheetData.Elements<Row>()) { foreach (Cell c in r.Elements<Cell>()) { string text = c.CellValue.Text; } } } }

The method above will definitely work for reading cell values within a worksheet. However, if the worksheet is quite large then the program's memory footprint will also be quite large. In fact, you are left at the mercy of the garbage collector, which may result in the program throwing an Out of Memory exception.

Open XML SDK SAX-Like Approach

If you have performance and/or memory concerns when developing an Open XML solution then you should look into using the OpenXmlReader and OpenXmlWriter. In today's post I will cover how to use the OpenXmlReader. I'll try to cover the OpenXmlWriter in a future post.

The method below will read every cell value for each row contained within the sheet data using the SAX-like approach.

Note that code below works for cells that contain numeric values. Cells that contain other types of content will need a different type of solution.

static void ReadExcelFileSAX(string filename) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { WorkbookPart workbookPart = myDoc.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();   OpenXmlReader reader = OpenXmlReader.Create(worksheetPart); string text; while (reader.Read()) { if (reader.ElementType == typeof(CellValue)) { text = reader.GetText(); } } } }

The method above takes advantage of the OpenXmlReader class, which allows you to read an entire part without loading the part in memory. The Create method has the following signatures:

  • public static OpenXmlReader Create(OpenXmlElement openXmlElement);
  • public static OpenXmlReader Create(OpenXmlPart openXmlPart);
  • public static OpenXmlReader Create(Stream partStream);
  • public static OpenXmlReader Create(OpenXmlElement openXmlElement, bool readMiscNodes);
  • public static OpenXmlReader Create(OpenXmlPart openXmlPart, bool readMiscNodes);
  • public static OpenXmlReader Create(Stream partStream, bool readMiscNodes);

As you can see, the Create method allows you to read either a part or an xml subtree given a specific element. There are methods that allow you to skip miscellaneous nodes that are not associated with the Open XML format. Using the OpenXmlReader class allows you to iterate the entire xml tree one node at a time.

Conclusion

Hopefully, this post gives you more insight into reading large Open XML files, while avoiding performance or Out of Memory Exceptions. If you guys have other Open XML SDK related post requests, let me know.

Thanks,

Zeyad Rajabi

Comments

  • Anonymous
    May 31, 2010
    the Open XML SDK help our dev a lot. It is more convenience than original solution

  • Anonymous
    June 01, 2010
    With the SAX / OpenXmlReader approach, is it possible to:-

  1. Detect in advance what the worksheet's UsedRange is?
  2. Read in on a row-by-row basis? Can you give examples of both, please. Thanks.
  • Anonymous
    June 03, 2010
    @Andrew – Here are the answers to your questions:
  1. Look for Dimension element to see what the used range is for a given worksheet. This element should be one of the first elements encountered in the worksheet part
  2. You can use the reader to find rows by using the construct if (reader.ElementType == typeof(Row)). Once you find the row you can do whatever you need.
  • Anonymous
    June 07, 2010
    Hi, I have a question regarding the office 2010 XML format. Every other time I save the same copy of Word document using Word 2010, the structure of the XML for MERGEFIELDS is like this:  
  • <w:fldSimple w:instr="MERGEFIELD TBL_education_specialityname">
  • <w:r w:rsidR="006D13B3">
  • <w:rPr>  <w:noProof />  </w:rPr>  <w:t>«TBL_education_specialityname»</w:t>  </w:r>  </w:fldSimple> and the other times it's  like this:
  • <w:r>  <w:instrText xml:space="preserve">MERGEFIELD TBL_education_specialityname</w:instrText>  </w:r>
  • <w:r> Do you know why that is? Thanks in advance.
  • Anonymous
    June 08, 2010
    This is Excel format in XML format links  method olerun() .one be flagged on ID(006D13B3) , then  linking to sigma functions to newfile. <w:r> tag on Wow server and scrips of <r> ;<t> TBL_education_specialityname</t>

  • Anonymous
    June 08, 2010
    Ovda8ao, This has nothing to do with excel I just took out a small part from the word document and pasted it here. Both of these xml's are representing the same part of the document(which I created) but it changes between these 2 every time I save the document.

  • Anonymous
    June 10, 2010
    The comment has been removed

  • Anonymous
    June 10, 2010
    @Zeyad - Thanks for your reply. I didn't know where to post this problem. This just started occurring when I switched to Office 2010(prior to that I was using office 2007 and had no problems) and this is the case on every computer that is in my company. I developed ,using the openxml sdk 2.0, which heavily depends on the fldSimple but as I say every time I save a document it switches between these structures of xml, which is not acceptable at all. So my question is, is this the case with every 2010 word?   Thanks.

  • Anonymous
    June 11, 2010
    I am really interested in an example on writing large Excel files. Is there already an example out there somewhere that you know of? Looking forward to the OpenXmlWriter post!

  • Anonymous
    June 15, 2010
    We spent a little while looking at how to achieve this without the use of Word automation/Interop before deciding to examine how quickly I could achieve this using the OpenXML SDK and streaming the contents of the XML into an SPFile object.

  • Anonymous
    June 21, 2010
    There is no First metod workbookPart.WorksheetParts.First();

  • Anonymous
    June 21, 2010
    @Bexx - There are occasions where Word will change the way a given field is written. It seems that we made some changes to Word 2010 such that changes to the way fields are written is more common. @David - My next post will be about writing large Excel files with the OpenXmlWriter class. @Abdo - Make sure you include a reference to LINQ.

  • Anonymous
    June 22, 2010
    @Zeyad - Is there any chance that I could talk further with you on this matte, instead of posting here? My email is bjorgolfur@tolvumidlun.is plz send me an e-mail if that is possible. Thank you. Best Regards, Bexx