Word XHTML - Compliance and Styles

This is the 2nd post in a series by Zeyad Rajabi who is a program manager working on Word's XHTML output used in Word's new blogging feature.

My first blog gave a brief introduction on our XHTML output for the blogging feature in Word 2007. This post will outline details on the styles we output.

Goals

In my last post I said we wanted to be XHTML compliant by the time we ship this blogging feature. Today I want to be a little clearer as to what I mean: strict vs. transitional.

Working on Word I have come to understand Word’s HTML and CSS capabilities. Word only supports a subset of the standard HTML 4.0 specification and similarly only a subset of the standard CSS 1.0 specification. Yes, you read correctly, CSS 1.0. For the most part, the feature set we offer within the blogging tool allows us to output CSS properties that Word supports and can render correctly. However, there are a few examples where we are unable to output CSS properties (in order to be XHTML Strict compliant) because Word would not be able to read them in. All unknown HTML and CSS in Word are basically ignored, and it was a goal that the blog posts could be edited by Word after they are published.

What does that mean for our output?

At a minimum, our goal will be to always validate as XHTML 1.0 Transitional compliant code. For a basic blog we will validate as XHTML 1.0 Strict compliant code. For those blogs that use features where we cannot output Word supported CSS, our aim is to be XHTML 1.0 transitional compliant.

Word can certainly output any HTML or CSS, but the issue then is around roundtripping, which is the ability to generate HTML or CSS that can be read back in correctly. An obvious question would be to ask why Word can't just add the functionality to read those additional properties back in correctly. This would be great, but we are on a limited budget, and that would have meant taking away other features that we have prioritized higher. Because of this, there is a fine balancing act that we must perform: roundtripping vs. XHTML output.

XHTML Style Output

Feature

XHTML CSS Property

HTML Elements

Font

colorfont-familyfont-sizetext-decoration:line-throughtext-decoration:underline*

spanspanspanspanspan

Block

text-align*text-indent*

pp

Background

background-color

span

Box

margin-left*

p

Table Padding

padding-toppadding-leftpadding-bottompadding-right

tdtdtdtd

Table Borders

border-collapse:collapseborder-topborder-leftborder-bottomborder-right

tabletdtdtdtd

Position

width

col

CSS properties with * marked implies that we will output those XHTML CSS styles post Beta 2.

An interesting property that is missing is float. Unfortunately, Word does not understand that CSS property. Instead, we will use the HTML attribute align, which will make us XHTML Transitional compliant for the blogs with that type of content. We can output float, but if the post is ever read back into Word that property will be ignored, thus making the image not floating anymore.

Another interesting thing to point out is the styles we output for tables. As you can tell our HTML output for tables in Beta 2 is quite bloated. This table will certainly be updated as we get closer to release. I will post the complete spec of Word's XHTML support at a later date.

Suggestions are welcome

Anything missing?

Comments

  • Anonymous
    May 30, 2006
    This is not exactly related to you post, but I do have a question about the blogging feature.

    Is there a way to get at the XHTML it generates without actually posting to a blog? If I save a blog document as HTML, it just saves it using the old full-fidelity (ugly) HTML.

    I'm asking because the blogging software I use (b2evolution) doesn't seem to be supported by beta 2 (or at least, I couldn't get it to work) but I'd still like to just be able to cut'n'paste the XHTML it generates into the online form for posting. At least that way I would be able to do all the easy formatting in Word...
  • Anonymous
    May 31, 2006
    The XHTML output is specific to the blogging feature. The reason we decided not to offer a Save As option for the XHTML output is because the level of fidelity does not match all of Word’s feature richness. This blogging feature is meant to maintain the fidelity of a small percentage of the total feature set available from Word.  

    Zeyad
  • Anonymous
    May 31, 2006
    I understand your point.

    Joe Friend mentioned in his post that it was extensible so people could write extensions for their own blogging engines. Perhaps someone could write a plugin which writes it to file? Is that extensibility in beta2 or will we have to wait for it?

    I think it'd be good for general-purpose HTML publishing as well, since most HTML editors aren't exactly great word processors...
  • Anonymous
    May 31, 2006
    I understand that it is only blog output that is going to be XHTML compliant, due to the difficulty of achieving full fidelity in an XHTML rendition of a Word document.

    If I need to produce XHTML from a Word document, which is already a very common need and will increasingly be so, what is the recommended approach?

    In previous versions I have taken the Save as filtered HTML output and run it through Tidy.  I am presuming that using the new open XML file format and then using XSLT to transform it is now the preferred approach.  If so, is Microsoft doing work in this area (I know there was a WordML to HTML transform) and if not where's the best place to look?  Seems inefficient for hundreds of developers who need this to go and reinvent the wheel...!

    Thanks!
  • Anonymous
    June 01, 2006
    We are looking into different ways of making this available; like via the OM. For the moment Marcusf the approach you mention is the only way.
  • Anonymous
    June 06, 2006
    I love the blogging feature.  I routinely have to edit in another editor then post to my blogging backend.  However, the one problem I have with's Word's blogging tool is that I can't seem to turn off the "fancy quotes" feature. (I'm sure there's a real name for this feature but I don't know it.) So when I post to my blog, I get odd characters instead of apostophes.

    Am I missing something easy, or could better control of characters that are allowed be added?
  • Anonymous
    June 06, 2006
    The comment has been removed
  • Anonymous
    June 06, 2006
    One of the features in Word 2007 Beta 2 is the ability to author blog posts. Joe Friend announced the...
  • Anonymous
    June 07, 2006
    The comment has been removed
  • Anonymous
    June 07, 2006
    "Smart Quotes" I knew there was a term for it... Thank you very much.

    Hmm.  Okay, I found it, but it's global.  So I'm not completely happy with that.  So, I looked into plugins for my blogging tool (Movable Type which I know isn't supported, but was still easy to setup Word to use) and found a plugin named "Naughty Word Chars." It repaces special characters with a normal equivalent. I prefer this as I like "Smart Quotes" but in my documents, not my blog posts.

    I would have to say my problem is solved, and I don't hold it against Word.  I switched to only using Word to post now though I still have to massage a bit with the Movable Type admin.

    So if it isn't clear, I still love this feature of Word.
  • Anonymous
    June 08, 2006
    This is the third post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...
  • Anonymous
    July 12, 2006
    This is the fourth post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...
  • Anonymous
    August 14, 2006
    This is the fifth post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...
  • Anonymous
    August 24, 2006
    Good news and bad news:
    Good News:

    The BPEL TC approved going to Public Review yesterday so look...
  • Anonymous
    June 18, 2009
    PingBack from http://outdoordecoration.info/story.php?id=1667