XHTML in Word 2007's blogging tool

Today we have a guest writer to discuss the HTML output that we have in the new blogging functionality for Word 2007. His name is Zeyad Rajabi and he's a program manager on the Word team. Zeyad works on file format related issues, including the HTML support in Word. All of Zeyad's posts will be under the "Word HTML" category if you are interested in tracking those seperately.  

As some of you may know from Joe Friend’s blog, Word 2007 will allow users to author blogs straight from Word. I want to follow up on Joe’s blog by giving you guys more details concerning our XHTML output for the blogging feature. I hope to use this blog as an opportunity for you to comment on our blogging XHTML output and to make any suggestions.

Goals

Before I get into details about our XHTML output, I want to outline the goals for our blogging feature. The design goals behind the XHTML output from the blog tool are significantly different from what we’ve done in the past:

  • Output XHTML compliant code for each post (we are following the W3C spec)
  • Output clean and readable XHTML

Instead of concentrating on supporting 100% of Word’s features (as we did in the past) the blog feature will support a much smaller set of features and additionally concentrate on outputting clean and readable XHTML. The blog feature will only output the necessary XHTML needed to represent the document. No more redundant HTML or CSS. No more Microsoft Office specific CSS properties. We will output just clean and easy to read XHTML.

Known Beta 2 Issues

There are still some known bugs in the XHTML output for Beta 2. I wanted to point them out so that you aren’t surprised:

  • Strikethrough - We are outputting CSS property text-decoration for strikethrough instead of <del>
  • Divs around lists - We are outputting div tags for every list item. We do not need to output these extra elements
  • Block level elements within inline elements - We are not XHTML compliant in some cases because we are not following proper tag content flow. We are outputting block-level elements inside inline elements.
  • Multi-level lists - We are incorrectly outputting multi-level lists in terms of being XHTML compliant. We are outputting the incorrect XHTML in that we are closing the lists for before sub lists are closed
  • Table bloat - Our XHTML output for tables is too heavy and contains too much redundancy

I am sure there are more bugs to be found and I’m sure you guys will help me add to the list! As you play with the blogging feature, please feel free to send me any questions or suggestions you have. I want to make this feature great for all of us.

XHTML Output

There is too much to discuss in this first post, so I think I’ll break down the XHTML output into multiple categories: formatting, styles, lists, images and tables. I’ll have a separate post for each category so we can have some more targeted discussions. Another thing that I was thinking about doing was pulling all of this together as a public spec that I can post. Again, I would love for you to send me suggestions on any or all of the categories.

For those interested take a look at the source code the blogging tool generated for this post (note that it's only the contents that would go inside the ).

Formatting

Today let’s look at some details around the XHTML we output for formatting features:

Feature

XHTML

Hyperlinks

<a href="https://www.foo.com" target="_blank" title="Tip">hyperlink</a>

Font

<span style="font-family:XXXXXX;">text</span>

Font Size

<span style="font-size:28pt">text</span>

Font Color

<span style="color:XXXXXX">Colored text </span>

Bold

<strong>text</strong>

Italic

<em>text</em>

Underline

<u>text</u>

Strikethrough

<del>text</del>

Highlighter

<span style="background-color:XXXXXX">text</span>

Alignment

<p align="left">text</p>

<p align="right">text</p>

<p align="center"text</p>

Indent

<blockquote>text</blockquote>

Suggestions are Welcome

I know there are a couple different approaches for all of these. If you disagree with our approach let me know. I’ve read a lot of differing opinions on some of these (especially indentation), so while we probably won’t get everyone to agree 100% on the approach, hopefully we can find the best approach.

Anything missing? Is there a better way of representing a feature in XHTML?

Comments

  • Anonymous
    May 21, 2006
    Looks like generated XHTML tag names are in upper case, shouldn't they be lowercase ?

  • Anonymous
    May 22, 2006

    The obvious question, what about fonts?

    How Office 2007-specific fonts are translated to be rendered on the web on anyone's machine? Are they? Jensen Harris explained that one of those fonts is chosen by default in any of Office 2007 applications, so that makes the issue even more blatant.

  • Anonymous
    May 22, 2006
    The comment has been removed

  • Anonymous
    May 22, 2006
    bad bad bad bad bad....

    Ok, so not as bad as it used to be, and this is a huge improvement, but gratuitious use of style tags are nearly as bad as legacy HTML font and various other presentation tags.

    The whole point is to remove presentation from the HTML, not simply change the way you embed it in the HTML.  

    Also, while align and target are valid in XHTML 1.0 Transitional, they're deprecated, and not valid in 1.0 strict or 1.1.

    Why not generate a style sheet?

  • Anonymous
    May 22, 2006
    Why are you using semantic elements like "strong" and "em"?  Do you really have any indication that those are the semantics actually intended, or are you just imputing that from the appearance of the text?  If you don't have real semantic information, then shouldn't you be using CSS to supply these, or just plain "b" and "i"?  It might be appropriate to use "del" for text that Word has marked as deleted, but why use it for text that is simply styled with a strikethrough?  And why are you using the deprecated "u"?  

  • Anonymous
    May 22, 2006
    I agree with mystere.  I don't get the impression your really trying at all here.  Put some effort into it man.

  • Anonymous
    May 22, 2006
    You will definitely need an options panel to allow users to control how much attribute-based formatting they want vs. css-based formatting. One case, I address with my Word 2003 tool, CleanXHTML, is to turn align="left" off for CSS friendliess.

    Denying "power users" to choose these levels will be another, "classic" Microsoft move that should be avoided at all costs.

  • Anonymous
    May 22, 2006
    I posted a comment to a previous post from Joe, but I'll repeat the link here.


    My suggestion is to use styles (or at least give the option) to drive the HTML formatting. Zeyad says you have some issues with lists and indenting. Of course you do - trying to largely flat map word processing formatting to a nested format.

    With a good set of styles then you can map from the word processor to XHTML much more reliably. Why not ship Word with a better set of styles than what comes in the Normal template by default?

    This post on my site has some pointers to more information about the way we do it, in a system that works with both Word and OpenDocument:

    http://ptsefton.com/blog/2006/05/13/beyond_blogging:_style-driven__html_export_from_2007._please.

  • Anonymous
    May 22, 2006
    "Denying "power users" to choose these levels will be another, "classic" Microsoft move that should be avoided at all costs."

    They're already doing it with the whole Ribbon UI.

    - Jon Peltier

  • Anonymous
    May 22, 2006
    The comment has been removed

  • Anonymous
    May 22, 2006
    Wow, lots of MS-haters here! Personally, I can't wait for this feature. I use Word 2003 to type out my blogs at the moment (mainly because of the spell checking/autocorrect) and I end up putting in the (X)HTML once I've cut'n'pasted my text from Word into my browser. Actually, the only reason I do it like that is because while I like the smart quotes in my text, I can't have them in my <a href=""> tags, so I go in and add them all once I've typed the text out.

    At least now (well, soon) I will just be able to do it all from Word!

  • Anonymous
    May 22, 2006
    Mike: the other thing about the upper-case tags is that the blogging software on blogs.msdn.com is Community Server, which uses FreeTextBox for it's WYSIWYG editor. So if the author posted from Word, then opens the post up in the Community Server editor, FreeTextBox munges all the tags so that they're upper case. It's pretty annoying actually, and it produces some horrible HTML, but that's what you get with web-based WYSIWYG editors, I guess.

    Zeyad: I think what people are recommending in terms of separate CSS is an actual separate file which we could install on the server separately, then instead of <span style=""> tags, we'd get <span class=""> tags or something. Myself, I don't think that makes sense. After all, that's not really how Word works - and as good as it is that we can get this clean XHTML out of Word, I don't think trying to force it to a whole new paradign just for blogging is a smart thing to do.

  • Anonymous
    May 22, 2006
    zeyad, it depends on what you're trying to indent, but usually a:

      <p style="padding-left:10px;"></p>

    or similar, should do the trick

    blockquote should only be used when quoting someone

  • Anonymous
    May 23, 2006
    Word supports a subset of the standard HTML 4.01 specification and similarly a subset of the standard CSS 1.0 specification. Unfortunately there are certain CSS properties that are not understood when applied to certain elements. Padding CSS properties cannot be applied to span, div, or p elements.

    We can use margin CSS properties on div or p elements. Any objections to using margin CSS properties for indentation?

    For anyone who is interested I am currently working on publishing Word’s HTML and CSS specification on MSDN. As soon as it is ready I will let everyone know.

    Zeyad

  • Anonymous
    May 23, 2006
    The comment has been removed

  • Anonymous
    May 23, 2006
    using margin instead of padding should be fine, it'll probably give you better compatibility with IE anyhow.

    out of interest, if Word only supports a subset of HTML 4, why not fix word so that ot supports XHTML 1.0 in the first place?

  • Anonymous
    May 23, 2006
    How can you be XHTML compliant when you are using the <U> tag? That is so outdated.

    Also the CSS needs to be lowercase that how everyone likes it!

  • Anonymous
    May 24, 2006
    As I promised, we are posting details of the HTML output spec for the blog feature and are interested...

  • Anonymous
    May 24, 2006
    <ins> and <del> are deprecated in the working draft of the XHTML 2.0 standard.

    There's others, too.

  • Anonymous
    May 24, 2006
    The comment has been removed

  • Anonymous
    May 25, 2006
    Actually k, the property for text indentation is literally text-indent. ;)

  • Anonymous
    May 25, 2006
    I think the first thing that ought to be addressed is WHAT XHTML version is the blogging tool going to support. Saying that you support XHTML is vague at best. Sound choices probably lie in XHTML Basic and XHTML 1.0 Strict. Once this is established you have a baseline for what tags and attributes can or should be supported.

    Secondly, there seems to be an unclear distinction between presentation and semantics. Bold (<b>) doesn't equal strong emphasis (<strong>), italicize (<i>) doesn't equal emphasis (<em>), strikethrough doesn't equal deleted (<del>).

    It’s my firm opinion if you can't determine the meaning behind the user's style choices then you can't blatantly add semantics to the output document. Outputting <b> and <i> (or style equivalents) for bold and italicize respectively is safe and ultimately the right thing to do without additional information of the user’s intent.

    Strikethrough should be left as is.

  • Anonymous
    May 25, 2006
    Tom I stand corrected.

  • Anonymous
    May 26, 2006
    Our aim is to be as close as possible to XHTML strict 1.1 by the time we ship this feature. That being said, as I mentioned in a previous comment, Word has some limitations as to what CSS is supported. I will touch more upon this limitation when I start talking about images and in particular floating images.

    To be safe I am going to say we are aiming to be XHTML Transitional compliant, depending on the content of the blog. There is a lot of balancing decisions being made in terms of fidelity vs. being XHTML compliant. I hope to use these posts as a means to ironing out those kinks.

    Jonathan you bring up an interesting topic: semantic vs. presentation. We decided to go down the route of presentation for bold and italic rather than output both bold/italic and strong/emphasis. What would be the main advantage of outputting bold instead of strong? I am trying to think of this in terms of all the different types of users who will use this feature; from the common user to the expert user.

  • Anonymous
    May 26, 2006
    I was hoping this feature (clean XHTML) would be available as a standard save option but, from a quick look at the Word Beta, it seems it is only available on the blog publish.

    Is there no way to get the XHTML code without using the blog publish?

    If so, why not make this an available save option?

    Surely getting the world's most widely used document format into clean HTML and onto the web should be easier?

  • Anonymous
    May 26, 2006
    The comment has been removed

  • Anonymous
    May 29, 2006
    I agree 100% with Mr. K, when he says, "Is there no way to get the XHTML code without using the blog publish?"

    "...why not make this an available save option?"

    Can someone please answer these questions?

    My wife and I work on maintaining several Web sites, none of which would qualify as a 'blog'. Why is MS suddenly abandoning us, in favor of bloggers. I could understand if you had already incorporated an .html export system in Word. But, in Word 2007, apparently you've removed what little support there was.

    Blogs are used generally by .html dummies and I can understand your wanting to score some points with them, but what about the rest of us who know something about .html and would like to use MS Word to help us create Web pages?






  • Anonymous
    May 30, 2006
    Currently the plan is to only make the XHTML output available via the blogging feature. That being said we are looking into different ways of making this available; like via the OM. The reason we decided not to offer a Save As option for the XHTML output is because the level of fidelity does not match all of Word’s feature richness. This blogging feature is meant to maintain the fidelity of a small percentage of the total feature set available from Word.  

    Zeyad

  • Anonymous
    May 30, 2006
    The XHTML output is specific to the blogging feature. The reason we decided not to offer a Save As option for the XHTML output is because the level of fidelity does not match all of Word’s feature richness. This blogging feature is meant to maintain the fidelity of a small percentage of the total feature set available from Word.  

    Zeyad

  • Anonymous
    June 01, 2006
    The comment has been removed

  • Anonymous
    June 02, 2006
    Anyone who follows the W3C's XHTML development and that of the web should be aware that presentational markup is being removed out of the body of the text in favour of CSS whether it is a blog or webpage.  

    Word with its templates and autoformat functionality is well placed to deliver this presentational separation. A simple word doc could be saved as XHTML without presentational markup but then that might diminish the role of the .DOC format!

    Jonathon P makes some very good points. Pick your flavour of XHTML transitional (very safe but pointless), strict 1.0 (expected and a good starting point) or even may I be as bold to emphasis XHTML 2.0 (daring and leading edge but unlikely) which is now a W3C standard or will be by the time Word 2007 ships.

    Equally what about the support for mobile word in the Mobile 5.0 OS.  Will XHTML-basic/MP be supported?

    Finally why is there a reference to Yahoo in the source code?
    http://geo.yahoo.com/serv?s=76001405&t=1149288117  

  • Anonymous
    June 02, 2006
    "The reason we decided not to offer a Save As option for the XHTML output is because the level of fidelity does not match all of Word’s feature richness."

    Then why offer the ability to save a Word document as a text file?

    At least if a save as XHTML feature could be accessed through some obscure VBA code then that would be good enough.

    It is not a case of using Word to design webpages but simply making it easier to get the contents of Word documents onto webpages.

    It doesn't have to be a standard or obvious feature but at least if it could be made available without using the blog publish then I'm sure it would be prove useful for many people who get Word docs and then have to get them onto the web without all of the Word HTML tags.

  • Anonymous
    June 06, 2006
    One of the features in Word 2007 Beta 2 is the ability to author blog posts. Joe Friend announced the...

  • Anonymous
    June 08, 2006
    This is the third post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...

  • Anonymous
    June 11, 2006
    Here's an encouraging titbit that demonstrates that community server's growth is
    set to continue.&amp;nbsp;...

  • Anonymous
    June 12, 2006
    We are looking into ways of providing our XHTML output outside the scope of just blogging. One such mechanism is through the OM.

    Mr. K save as plain text differs than our output and intention of XHTML. Our XHTML output is only a small subset of the total XHTML specification. Only a % of Word’s features can be represented by our limited XHTML output. As we continue to build on top of this blogging feature, and specifically our XHTML output, we will support more and more of Word’s features. As is, we do not want to provide our XHTML output as a Save As option because of the feature degradation. Plain text differs in that our output preserves as much fidelity as possible with respect to plain text. When we get to that level for our XHTML output I do not see any reasons why we wouldn’t have it on the Save As menu.

    Zeyad

  • Anonymous
    July 12, 2006
    This is the fourth post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...

  • Anonymous
    August 14, 2006
    This is the fifth post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...

  • Anonymous
    May 10, 2007
    PingBack from http://www.kintespace.com/rasxlog/?p=628

  • Anonymous
    June 18, 2009
    PingBack from http://outdoordecoration.info/story.php?id=1732