What’s up with all those “rsids”?

As many folks who worked with the 2003 wordprocessingML format have probably noticed by now, there are is a new set of attributes/elements in the Open XML wordprocessingML format that shows up all over the place. I'm talking about RSIDs.

The rsid element is used to allow applications to more effectively merge two documents that have forked. It's best to use an example for explaining the use, so let's image I have a document that has the following text (we'll call this document "Brian1"):

Clearly this is a great thing for the industry. I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

I then send this document out to my coworker Steve to review and make changes. Steve decides that he wants to add in a bit of a sarcastic remark for the first sentence so when he sends back the document it looks like this (we'll call it "Steve1"):

Clearly this is a great thing for the industry (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway). I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

While Steve was reviewing his copy of the document, I also made some changes. I removed that second sentence, so now my document looks like this (we'll call it "Brian2"):

Clearly this is a great thing for the industry. We now have an official standard that provides all the details necessary to read and write office documents.

Now, when Steve sends me his copy back, I'd like to have my word processor merge my document and his so that I get the most up to date version with both of our edits. Ultimately, the merged document would look like this (we'll call it "Final"):

Clearly this is a great thing for the industry (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway). I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

The blue text is tracked as an insertion and the red text is a deletion.

Now, why is this example interesting at all? Well, if we only stored the basic text of this document, it would be very difficult to merge. In looking at the difference between "Brian2" and "Steve1", how would the application know what was an insertion and what was a deletion? If I still had my original file ("Brian1"), it would be easy to track this, but that's most likely not the case. I only have my edited document "Brian2", and Steve's document "Steve1". How do you know that the text "I personally feel like it's really cool" wasn't something that Steve added, as opposed to something that I deleted?

One way you can do this is via "track changes" functionality, where the application tracks the insertions and deletions as they happen and stores that in the format, but this often isn't desired. Often, for privacy reasons, people don't want to have the revisions tracked in their documents. Instead, they just want to be able to merge the documents later, and have the application figure out what was inserted, and what was deleted.

Well, the way we deal with this is through revision identifiers (rsids). Every time a document is opened and edited a unique ID is generated, and any edits that are made get labeled with that ID. This doesn't track who made the edits, or what date they were made, but it does allow you to see what was done in a unique session. The list of RSIDS is stored at the top of the document, and then every piece of text is labeled with the RSID from the session that text was entered.

This approach is what allows us to properly merge the two documents. When we merge documents, we can see what RSIDS the two documents share. Any shared RSIDS will represent text that was entered before the document was forked. Any RSIDS that are unique to one of the documents represent edits that were made after it was forked.

This means that if we see text in one document, but not in the other, all we need to do is look at the RSID applied to that text. If it's one of the shared RSIDs, that means the text existed before the documents were forked. That also means that when we merge the documents, we can assume that the text was deleted from one of the documents, rather than added to the other.

Let's go back to our example. In the original file, the XML would look something like this:

<w:body>
<w:p w:rsidRDefault= "00544FOB" >
<w:r>
<w:t> Clearly this is a great thing for the industry. I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
</w:r>
</w:p>
</w:body>

This is saying that all runs (<w:r>) in the paragraph by default have the RSID "00544FOB". And in the document settings, we would have "00544FOB" listed as one of the RSIDs for the document. (note that there are a number of other places that RSIDs show up, but we're only focusing on the text for this case).

Now, after the document went to Steve, and he made his edits, the document "Steve1" would look like this:

<w:body>
<w:p w:rsidRDefault= "00544FOB" >
<w:r>
<w:t> Clearly this is a great thing for the industry</w:t>
</w:r>
<w:r w:rsidR= "00FF1F58" >
<w:t>(unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway)</w:t>
</w:r>
<w:r>
<w:t>. I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
</w:r>
</w:p>
</w:body>

Notice that while the formatting properties on all three runs are the same, the RSID values are different. This happens because Steve added that additional text, so it was assigned to a new RSID value "00FF1F58". If you look in the document settings for this document, there will be two RSIDS: "00544FOB" and "00FF1F58".

Now, separately I opened my copy and deleted some text. So the document "Brian2" is going to look like this:

<w:body>
<w:p w:rsidRDefault= "00544FOB" >
<w:r>
<w:t> Clearly this is a great thing for the industry. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
</w:r>
</w:p>
</w:body>

Notice that the runs in the paragraph all have the same RSIDs still. There aren't any new RSIDs in the body because I didn't add any text. I did however edit the document, so if you look in the document settings, there will be a new RSID. So in "Brian2", we have the following two RSIDs: "00544FOB" and "00A95BA5".

So, when we go to generate the "final" document, we merge "Brian2" with "Steve1". As we merge the two documents, we see that they share the RSID "00544FOB", but that all other RSIDs are unique to those copies. This means that any text with the RSID "00544FOB" existed in the original file, and any other text was added after the fork. There are two pieces of text in Steve's document that aren't in mine. The first piece of text that reads " (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway) " was an addition made by Steve, rather than something I deleted. That text had an RSID unique to Steve's document. The other text that reads: "I personally feel like it's really cool." on the other hand has an RSID that is shared between the two documents. That tells us that it was deleted from my copy, rather than added to Steve's.

So, next time you're looking at a wordprocessingML document and you're wondering why it's broken out into so many runs, you'll know the answer. This is another example of how the simplicity of the flat schema wordprocessingML uses makes it easy to add properties to the various runs of text. The RSID isn't a container, but rather just a property of the text. If we had the ability to nest runs within other runs (similar to the HTML <span> model), then it would be a bit more complicated (not impossible, just more complicated). The architecture of a wordprocessing file is much simpler. Since runs can be nested in other runs, you have a more predictable ancestor list to walk through when finding the properties of that particular run.

If you would rather not have these RSIDs in your files, it's easy enough to turn off. Just go to the trust center and turn off the setting: "Store random number to improve combine accuracy"

Two other important things to note. First is that the RSID tells us nothing about the time or order things were done. They are completely random, and are only used for seeing where things match. So they aren't of much use unless you are merging with another document that also has RSIDs. Another thing to note is that these are not just used for content, but other settings as well like styles, layout, etc.

-Brian

Comments

  • Anonymous
    December 11, 2006
    There's an interesting post on Brian Jones's blog today about how rsids work in Open XML documents .

  • Anonymous
    December 11, 2006
    HI Brain, i want to get the xml document which is created by wordML2007. from the extracted xml files need to combine thr Java program  and make one xml file n one xsl file to render PDF. could please help regarding the same thanks in advance. Somanna R

  • Anonymous
    December 11, 2006
    How do you guarantee that the RSID's are unique ? Is Word using a public API to generate the RSID's ?

  • Anonymous
    December 11, 2006
    How will be dealt in such a merge with elements that do not have revision tags ? For instance if some edits the documents using an OOXML editor that does not support revisions. Can the merge function also merge on the basis of the added or altered text then ?

  • Anonymous
    December 12, 2006
    Wonderful comments this one attracts. My comment: Nice.  Sweet.  Did I say nice?   I love it when something so simple works so well.  I hope you guys are proud of yourselves, you earned your pay with this one.

  • Anonymous
    December 12, 2006
    The comment has been removed

  • Anonymous
    December 13, 2006
    It would be removed by the document inspector, but it's not really a privacy issue. You could tell that parts of the document were edited at different times, but you wouldn't know the order. The rsids are unique, and not sequential. If you are building your own tools for editing or generating documents, there are no real rules for how you generate the rsids, just that if you want it to work properly you should make sure you've made them unique. If the rsids are not present at all, then you can still merge documents, it just may not be as accurate. -Brian

  • Anonymous
    December 16, 2006
    Si vous avez déjà eu l'occasion de manipuler les documents Word au format Xml, vous avez sans doute remarquer

  • Anonymous
    December 17, 2006
    "Store random number to improve combine accuracy" instead of "Enable anonymous edit tracking"? I guess it depends on whether your intended audience is developers or users. We still have users who don't know that "Fast Saves" really means "Keep most edit details to enable possibly embarassing or compromising disclosures later to save a little time saving files."

  • Anonymous
    June 03, 2008
    Oggi da un cliente, pi&#249; o meno 12 secondi dopo aver varcato la soglia, mi &#232; stata fatta una

  • Anonymous
    November 03, 2008
    [Blog Map] A convenient way to explore Open XML markup is to create a small document, modify the document