Word XHTML - Mapping styles to semantics

This is the third post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In earlier posts, Zeyad discussed a general overview of the XHTML as well as a more detailed post on XHML compliance. Today Zeyad is discussing the ways in which styles have been directly tied to specific XHTML tags.

Today I wanted to talk a bit about the template that we use for the Word 2007 blogging feature. Word has always concentrated on the presentation of documents and making it easy for people to quickly create a great looking document. The area that we haven't focused on as much though is allowing people to better specify the semantic meaning of their content. We've been slowly moving in that direction with the custom XML support in Word 2003 and the content controls and XML mapping in Word 2007. We actually leverage a content control to allow you to specify the blog's title.

One of the oldest ways folks would specify semantic meaning in a Word document though was by using styles, and we've done work in Word 2007 to make styles much more convenient for the average end user. We've created a number of Word styles that we map directly to XHTML tags of semantic meaning (like <strong> <em> and <blockquote>). We then let the browser and blog sites determine how to render these tags (based on stylesheets, etc.).

In Word 2007 one of our investments was giving our users easy access to applying styles via "quick styles". In our blog template we provide a list of styles that can be applied to the contents of the document as you can see in the screen shot below:

Styles

These styles are all significant in that we can map them directly to XHTML tags (rather than simply to formatting properties). Below is a table listing all the styles provided by our blog template and their XHTML equivalent.

Style

HTML

Heading

h1, h2, h3, h4, h5, h6

Normal

p

Quote

blockquote

Code

<pre><code>… </code></pre>*

Strong

strong

Emphasis

em

*The code style is being added post Beta 2.

What do you guys think of having a style called "Code" that actually nests pre and code together? Word differs from the web in that we automatically preserve whitespace, so in order for us to correctly output XHTML for the code style we also need to output the preformatted style.

One interesting discussion that came up in some of my previous posts was whether it was better to use <b> and <i> or <strong> and <em> when people applied bold or italic formatting to their text. Having the <b> and <i> tags in the HTML guarantees that the look of the document will likely not change regardless of style sheet (it's more likely that <strong> would have CSS props than <b>). On the other hand, <strong> and <em> provide much more flexibility in that they really only imply semantics and not display values. While <strong> and <em> have a default presentation, it is often overwritten by the CSS of the page or the rendering engine.

Some people were saying that there are occasions where <b> and <i> better capture what the user intended. While I do agree that may be the case at times, I believe also our UI encourages the use of bold and italic when the user often was just trying to convey the semantics. In most cases, I believe they would have specified strong or emphasis if it were as easy and obvious as bold and italic (there just hasn't in been a benefit to doing so in the past). Since we are going the route of XHTML compliance and we are concentrating on structure rather than presentation, we opted to always output strong/em rather than a more confusing mixture of the two (bold/italic and strong/em).

Custom Styles

One area that we are looking at investing in is giving folks the ability to add custom styles to the blogging template that would then be output as a simple style tag. So unlike the above examples where the style is mapped to a specific XHTML tag, we would simply output a <p> or <span> where the class name then matches the style. So, if a user adds the style "foo" to their blogging template then when that style is applied we would output:

<p class="foo">…..</p>

We would not output the formatting information for the style because in most cases the CSS would be stripped upon publishing to a blog provider. Instead, with this approach, you could rely on the CSS of the host site of the blog to specify the presentation information for those custom styles.

Comments are welcome

Any comments or questions are welcome. Also let me know if there are any other similar structures you guys are interested in talking about next (ordered and unordered lists, definition lists)?

Comments

  • Anonymous
    June 08, 2006
    If you have a "strong" or "emphasis" style, then you do clearly have indication of semantics, so <strong> or <em> would be appropriate.  However, there are many, many ways in which bold and italic styles are used even in English writing practice, let alone other traditions--just note the number of other semantic tags in HTML, that by no means cover the full range of usage.  So for text marked as bold or italic without a "strong" or "emphasis" style, it seems to me that <strong> and <em> would not be appropriate.  If you don't like <b> or <i>, you can always use CSS inline, but <b> and <i> are perfectly reasonable things to use within a blog entry, which is after all usually more a personal essay than a presentation of structured content--most blog structure is outside of the individual entry.
  • Anonymous
    June 08, 2006
    It may be my knowledge of English, but in this context, to me 'appropriate' means: what does reflect the intention of an author best.

    And almost all of times an author wants to emphasis some piece of text and does not (or should not) care how this task is accomplished. That is (or should be) up to a graphic/visual/publication designer. For one publication presenting the text as bold is best; for others choosing a different color can be better. What's even more: such choices can change over time. (so can the selection of text to emphasis be, but that's another discussion).

    If you have a choice: always go for semantics and leave presentation to a later stage of the production of the document. If possible, that should be dealt with at reading time (in this case by applying a CSS). At Sevensteps, we are very, very strict in this from the beginning, and has always helped us accomplish the task.
  • Anonymous
    June 08, 2006
    Zeyad has posted more information on the XHTML output from Word's new blog feature: Word XHTML - Mapping...
  • Anonymous
    June 08, 2006
    Custom Styles would be a great benifit to me and probally many others. That's one of the things that right now will limit my use of Word in this regard.  Keep up the good work!
  • Anonymous
    June 08, 2006
    Zeyad, I'd like to talk about all kinds of lists - and how they might interact with other styles.

    To get good list support in XHTML styles would be a good way to go. You can set up styles that work together. Eg how about a set of numbered styles like:

    ListNum1 ... ListNum5
    ListBull1 ... ListBull5
    Quote1 ....  Quote5
    Preformat1 ... Preformat5
    dt1 ... dt5
    dd1 ... dd5

    This means you can put a Quote2 style after a ListNum1 and the formatter will know that you want to embed the <blockquote> in the <li>. I work on a project where we have built a reasonably complete set of such styles and the formatter to output XHTML. It works well, via a custom nested styles menu. You can see a screenshot in this post of mine:  http://ptsefton.com/blog/2006/06/07/word_lists_without_tears

    Without the styles I doubt you will be able to build good support for nested lists combined with other semantics like quotes  or code sample. With  styles it is trivial to code.

    If you can't add support for style-based lists as I suggest here then how about adding hooks for a user-defined XSLT stylesheet and/or macros that have access to the underlying Open XML. That way my team can write it for you :-).
  • Anonymous
    June 08, 2006
    Actually, Word (as far back as the venerable 97 edition) already comes with "emphasis" and "strong emphasis" styles (translating of course into normal + italic and normal + bold by default) in its default stylesheet, so Word internally can preserve the semantics. It'd then be just a matter of user interface, hooking up the "bold" and "italics" buttons to the styles (apropos, do styles finally cascade in Word 2007?) rather than creating a new range that subclasses the existing style. You could then define some sort of mechanism to map Word styles into CSS styles and XHTML elements, only resorting to <span> for nameless styles

    While I'm here: can lists nest in Word 2007?
  • Anonymous
    June 08, 2006
    I think it would be better if quotes are mapped to <blockquote><p></p></blockquote>, for cases wherein the blog site is going for XHTML Strict. In Strict, blockquotes must only contain block-level elements.
  • Anonymous
    June 12, 2006
    The comment has been removed
  • Anonymous
    June 12, 2006
    I have written a little more on this, with more concrete examples of list formatting. http://ptsefton.com/blog/2006/06/13/list-samples


    Opening up the file and transforming it is fine - but that's not as good as having a built-in mechanism for export. Images are one big issue here. Word has always done a lovely job of exporting images to HTML (since word 97 anyway) whereas if you work only at the file format level you won't get web-ready images.

    If you guys don't want to add an XSLT export facility like OpenOffice.org (theirs is broken, doesn't do images at all) then please at least export style names as classes and we can post-process word docs to add correct nested structures.
  • Anonymous
    June 18, 2006
    I like the custom classes. I hope it works for images too! I have classes like leftalign and rightalign to align my images. :D
  • Anonymous
    July 12, 2006
    This is the fourth post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...
  • Anonymous
    July 14, 2006
    Hi Brian,
    I have a question regarding run properties defined in styles (sorry this is a bit off topic, not related to xhtml) :
    Let say I have defined two styles, one for a paragraph, the other for a run, as follows :

    <style name="P">
    <rPr><sz val="56"/><b val="on"/></rPr>
    </style>

    <style name="R">
    <rPr><sz val="24"/><b val="off"/></rPr>
    </style>

    now, if I have the follwing content :
    <p>
    <pPr><pStyle="P"/></pPr>
    <r>word1</r>
    <r><rStyle="R"/>word2</r>
    </p>

    then when opening in Word 2007 Beta, "word1" appears with size 26 (ok) and font weight is bold (ok)
    "word2" appears with size 12 (ok) and font weight bold (not ok!!)

    now, if I change the run property <b/> (in "R" style) and switch it "on" -> "word2" font-size is 12 and font-weight is not bold anymore...

    What I don't understand is that defining run properties in a style and applying it to a paragraph seems make all the runs inherit those properties. Except if they override them. It works for <sz> but not for <b> (worst , I have to turn it on!).
    I don't understand this behavior, can you put the light on this, please?

    thanks.
  • Anonymous
    July 20, 2006
    Hi Ray,

    Tristan has answered your question at:

    http://openxmldeveloper.org/forums/thread/371.aspx

    Zeyad Rajabi (MS)
  • Anonymous
    July 20, 2006
    It's great to see you putting so much thought into semantics as you develop the Word blogging feature, Zeyad. On behalf of the web standardistas of the world, we thank you!

    I think the nested pre and code elements for the "Code" style make perfect sense as long as what is being presented is a block of code. In many cases, however, an author will want to simply present a single keyword, or short code snippet, and for that I think an "Inline Code" style that simply mapped to a code element would be very useful.

    To satisfy those calling for greater control over semantic mappings (such as whether clicking the bold and italic buttons should produce strong and em elements, respectively), you would need to provide a preferences page with a series of mappings that could be switched on and off. Here are some ideas for what such a list might contain:

    - Output bold as strong
    - Output italic as emphasis
    - Output indent as quotation
    - Output strikethrough as deleted text
    - Output underline as inserted text
    - Discard non-semantic formatting (alignment, font, size, color, etc.)

    The ability to define custom style mappings to HTML classes is an excellent feature. I very much hope you can get this working!
  • Anonymous
    August 14, 2006
    This is the fifth post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In...
  • Anonymous
    June 14, 2009
    PingBack from http://cutebirdbaths.info/story.php?id=3665