Interoperable HTML Parsing in IE9

Artikkeli
09/13/2010

The HTML parser is an important part of how we deliver on same markup because it plays a vital role in how the DOM is constructed. Therefore, it also plays a big role in how any DOM API or CSS rule is applied. While we’ve talked a lot about some of the high-profile API improvements in IE9 – getElementsByClassName, addEventListener, and so on – one important improvement we haven’t talked about is the HTML parser.

This is clearly important for developers, so we made interoperability improvements to our HTML parser in IE9 Standards Mode. This blog post provides practical guidance on how these improvements affect your site and how to avoid pitfalls in areas where all browsers still don’t behave the same way.

innerHTML

Originally introduced as IE-proprietary APIs, innerHTML and outerHTML have gained some early traction as standards and are widely implemented by other browsers, but with some differences. These methods are unusual among DOM APIs in that they invoke the parser. In IE9 we made changes to address the most common interoperability issues.

Much of the work we did here was simplifying our behavior internally. Prior to IE9, we took whatever input was passed to innerHTML/outerHTML and treated it as if it were the only content in an otherwise blank page (resulting in an implicit <html>, <head>, <body>, etc.). We then attempted to merge this page back into the calling element, which sometimes resulted in an “Unknown Runtime Error.”

In IE9, we improved the behavior to support more cases while removing all occurrences of “Unknown Runtime Error.” In cases that still don’t work, you’ll get a descriptive DOMException instead.

While the mainstream scenarios work pretty well across browsers, these APIs are still evolving and interop isn’t perfect in every case. For example, the following has different results in different browsers:

 var img = document.getElementsByTagName(‘img’)[0];
img.innerHTML = “image text”;

The <img> element can’t have children, so the above doesn’t work in Chrome, Safari, or IE8, and has different behavior altogether in FF3.6. In IE9, Opera, and FF4 Beta, cases like this work as expected, and the text node is inserted properly.

In order to avoid problems with innerHTML, it’s a good idea to only feed it markup that can stand on its own. For example, calling div.innerHTML = “<p></p>” is fine, because <div> and <p> can exist without each other.

For small edits, you can also use DOM Core APIs like appendChild.

Generic Elements

One request from developers is having better support for generic elements. A generic element has the same syntax as any other element, but a tag name that isn’t defined in HTML (for example, <awesome>). IE9 Standards Mode follows the HTML5 spec and treats generic elements much like <span> tags. This means you can add more descriptive tag names to your page and style them as you would any other element:

<awesome style=”font-size: large;”>IE9</awesome>

This allows you to semantically describe the content of your page without losing any of the power you have with normal elements, using the same code as you would in other browsers.

Whitespace

One change that affects almost every page is how we parse whitespace. While IE8 removes or collapses whitespace, IE9 persists all whitespace into the DOM at parse-time. So the following markup:

 <div>
<span>IE  9</span>
</div>

Was represented in the IE8 DOM as:

 div
|->span
|--->”IE 9”

And is represented in the IE9 DOM as (whitespace in red):

 div
|->”\n“
|->span
|--->”IE\t9”
|->“\n”

If your site depends on the existence or non-existence of whitespace, this change has substantial impact. The document structure will contain far more whitespace nodes, so APIs like firstChild might not reference the same node they used to. Another consideration is text node length. Because whitespace is now preserved within text nodes, the character index within a string might be different from what you’re expecting.

IE9’s behavior matches the HTML5 spec and interoperates with other browsers. There are ways this behavior can make your page more fragile, depending on how you use whitespace in your markup. Here are a few suggestions for avoiding these problems:

For scenario where you just want elements, use the Element Traversal APIs – calling functions such as firstElementChild to ensure you don’t reference a stray newline character by mistake.
For scenarios where you need more than just elements, like text nodes, use explicit type-checking via nodeType or a similar API. Depending on why you’re accessing individual characters in a text node, the split() method on JavaScript’s String object could be quite useful for isolating the parts of a string you want to examine.

Overlapping Tags

As web developers, we don’t like to admit it, but we’ve probably all written the following markup at some point:

<b><i>important text</b></i>

Overlapped tags are a far more common occurrence than you might think, partly because they’re not always as obvious as the example above. Take the markup below:

<p><b><div>text</div></b></p>

The <p> element can’t legally contain a <div>, so IE, Firefox, Chrome, and Safari implicitly close the <p>. It’s almost as if you’d given this markup to the parser instead:

<p><b></p><div>text</div></b></p>

Notice that you didn’t even have to overlap your tags to end up in an overlapping tags scenario (the <b> element, in this case). This is just one edge case -- as you explore more scenarios, you’ll find that they can get pretty complex.

If you open up the IE8 Developer Tools to inspect the markup above, you’ll see this structure:

 p
|->b
|--->div
|----->”text”
div
|->”text”
p

It seems reasonable enough, but there’s actually more going on beneath the surface. In previous versions of IE, we persist the overlapped markup more or less as written – meaning an overlapped element could occupy more than one position in the DOM tree.

This state – called an inclusion – occasionally leads to behavior difference across browsers, especially when using script to walk the tree. For example, calling nextSibling on the <b> element above will return the second <p> and calling firstChild will return null. This occurs in spite of the fact that the <b> element appears to be a parent of <div> and have no siblings.

We improved IE9 mode to resolve such situations at parse-time to avoid these side-effects. In any place where earlier versions of IE would create an inclusion, IE9 creates a clone of the element instead.

So the markup from the example above would exist in the IE9 DOM as:

 p
|->b
b
|--->div
|----->”text”
p

IE9 clones the <b> tag when it sees the implicit </p> end tag. Thus, the DOM contains two distinct <b> elements, matching Chrome and Safari in this case. The HTML5 algorithm (supported by FF4 Beta) differs in that it clones overlapped elements upon encountering the next text node – resulting in a slightly different DOM structure above.

In order to avoid these types of problems in the first place, it’s a good idea to run your markup through the W3C’s online validator to help spot these kinds of problems before they become real bugs. For convenience, IE’s F12 Developer Tools have a built-in link to pass a site through the W3C’s validator.

Title Element

In IE8 and earlier versions of IE, the parser implicitly creates a <title> element whenever it encounters a <head>. As a result, developers in IE8 can assume that head.firstChild returns a <title> element, even if you don’t explicitly declare one in your markup.

In IE9, we made an interoperability change to respect the <title> element’s position in the <head>, like other major browsers.

Much like whitespace handling, this could result in your site behaving differently in IE9 than previous version of IE if you write applications that depend on the first child of your <head> element always being <title>.

If you need to grab the title, a better approach would be getElementsByTagName.

Object Element

Historically, the <object> element’s behavior in IE has been rather idiosyncratic, largely due to the fact that web sites often use it to interface with native code running outside the browser sandbox. In IE9, we’ve improved <object> parsing so it and its contents appear in the DOM like any other element.

For example, any <param> elements or fallback content inside the <object> will be persisted in the DOM, regardless of whether the <object> successfully loads.

This means that calls like the following will now work:

alert(document.getElementsByTagName(‘param’)[0].nodeName)

You shouldn’t have to do anything special to take advantage of our new behavior – but you can now interact with <object> much like you can most other elements and like you can in other browsers.

While these changes may seem less important than adding or changing an API, the impact on real web development is substantial. If you’re a developer, try your site in the latest Platform Preview and look for any problems resulting from the changes above.

As always, please send your feedback via Connect or the comments section.

Thanks!

Jonathan Seitel

Program Manager

Comments

Anonymous
September 13, 2010
Wait, those diagrams are ambiguous. Did you mean div |->span |--->”IE 9” or div |->span |->”IE 9” ?
Anonymous
September 13, 2010
Hello! does this mean that internet explorer 9 will have images/pictures showing up faster?
Anonymous
September 13, 2010
> The <img> element can’t have children, so the above doesn’t work in Chrome, > Safari, or IE8, and has different behavior altogether in FF3.6. In IE9, Opera, and > FF4 Beta, cases like this work as expected, and the text node is inserted properly. Inserted where? In the <img> element where it can't go?
Anonymous
September 13, 2010
Reading the IEBlog is like watching Elaine dance on Seinfeld.
Anonymous
September 13, 2010
Why not just implement the HTML5 parsing algorithm? That's the best road to standards compliance and interoperability.
Anonymous
September 13, 2010
@Brianary: The text node is intended to be a child of the <span>. I apologize if my ASCII art was unclear -- I used whitespace to show nesting when I was first writing the post, but eventually switched to arrow length because I thought it looked better. @Dave: Yes, according to HTML5, the text node should be a child of <img>.
Anonymous
September 13, 2010
IE definitively needs some work around the .innerHTML, glad your team is recognizing it. Does this mean web authors will finaly be able to user .innerHTML on a <select> object or a <tr> object? Thanks
Anonymous
September 13, 2010
Is this going to be as fast as FireFox?
Anonymous
September 13, 2010
I second Mathieu Pellerin's question. Will innerHTML work with tables?
Anonymous
September 13, 2010
Huh, using "generic elements" just because it works is a terrible idea. If people start using that instead of class="", it won't be possible to add any new elements to HTML because they will clash with existing content.
Anonymous
September 13, 2010
> Why not just implement the HTML5 parsing algorithm? That's the best road to standards compliance and interoperability. +1
Anonymous
September 13, 2010
The comment has been removed
Anonymous
September 13, 2010
The comment has been removed
Anonymous
September 13, 2010
“This allows you to semantically describe the content of your page” To cite Humpty Dumpty: “When I use a word it just means what I choose it to mean”. Alice, me, all browsers and all search engines won’t be able to understand your proprietary semantics. And worse, they might clash with standard semantics of the future. What a bad bad bad suggestion! Like nobody has ever thought about it. And as if adding a simple class wouldn’t solve every use case.
Anonymous
September 13, 2010
> Stilgar > I second Mathieu Pellerin's question. Will innerHTML work with tables? Well, they say, they're investigating this: connect.microsoft.com/.../582525
Anonymous
September 14, 2010
The comment has been removed
Anonymous
September 14, 2010
@Sarah On the contrary, it allows for new semantic description standards to be developed. For example, perhaps a search engine wanted to allow developers to better semantically describe a page so that, say, the site's logo could be deciphered: <logo><img src="newlogo.png"></logo> Furthermore, it allows for backwards compatibility should a future extension to HTML5's standard semantic tags (header, footer, nav, etc.) be drafted in the W3C. Newer browsers (or other user agents, such as search engines) would recognize the new tag's semantics while older browsers ignored it.
Anonymous
September 14, 2010
With all this talk about "same markup" I really wonder why you didn't implement the whole HTML5 parser. Maybe you did this for Beta1/Preview5 (I hope, but doubt it), but as of now,
innerHTML still doesn't work on any select (Connect#571341) or table (Connect#582525) related elements.
innerHTML returns incorrect case of element and attribute names (Connect#584933, #584531).
name attributes on generic Elements are ignored (Connect#557785).
Incorrectly nested elements create an unexpected DOM and rendering (Connect#582974). Result: Web authors all over the world want you to Implement the HTML5 parsing algorithm (Connect#584766). And so many small isues are simply won't fixed. /sigh Still looking forward to the next preview though.
Anonymous
September 14, 2010
@badger: No you can’t. To develop a standard, well, you need to go through standardization. That’s not a one man thing. See Standardization in Wikipedia. You could participate at the WHATWG or W3C if you wanna do that. They are very successful. Your logo example can either work for just one search engine vendor (with damaging side effects perhaps) or for all. That’s the difference between proprietary extensions to a standard or evolving of the same standard. To see the seriousness of this: IE9 will just do what IE should have done in the first place with unknown elements. IE<9 has hindered the standard development a lot by its parsing of unknown elements not according to how it is defined. Because of this the WHATWG has a hard time with introducing new elements. So it is good that it will now be easier to invent new elements, but nobody should do it on their own. That would be be like one step forward, two steps back.
Anonymous
September 14, 2010
@Jonathan Seitel - as noted by many people here in the comments on this post as well as pretty much every post since it started the fixes you noted are all worthy and appreciated but the 2 places developers want the support most is setting the innerHTML on Select and Table elements (and all children of Tables) I'm quite shocked that you made a post about fixing innerHTML in IE - yet completely neglected to touch on the above bugs to discuss a timeline for when we can expect a fix.
Anonymous
September 14, 2010
I'm quite shocked that the trolls haven't gotten bored yet and found something else to do.
Anonymous
September 15, 2010
What about inheritance of old HTML4 styling attributes in tables? IE apparently is apparently doing some hackery there and some sites rely on that making life harder for the other browser makers: bugzilla.mozilla.org/show_bug.cgi
Anonymous
September 15, 2010
Just downloaded IE9Beta, and confirmed <select>.innerHTML is still broken. It's killing me to witness IE9 shaping up to be a very interesting piece of technology while at the same time failing to fix decade old bugs...

Jaa

Interoperable HTML Parsing in IE9

Comments

Lisäresursseja