Clarification on 'WordDocument' stream in OLE file format

Parth Gupta 180 Reputation points
2024-06-02T11:43:36.5766667+00:00

Hi,

I am trying to parse a '.doc' file (OLE file) in Python. I am trying to understand the structure of the 'WordDocument' stream inside the file.
With reference to [MS-DOC] and [MS-CFB], it is known that this stream must be present and have the FIB (File Information Block) at zero offset.

I can parse the complete FIB.

However, I am unable to understand the content of this 'WordDocument' stream after the FIB. There is some document text in this stream but also a weird pattern of alphabets (see the attached screenshot of a file opened in Hex editor). Kindly provide some explanation of this.

In the documentation, it is mentioned that this stream has no pre-determined format other than the FIB being present. But then, something must explain what type of data is stored in this stream after the FIB.

Kindly point me towards a documentation (if any) that explains the contents of the 'WordDocument' stream other than FIB in detail.

Thanks

User's image

Office Open Specifications
Office Open Specifications
Office: A suite of Microsoft productivity software that supports common business tasks, including word processing, email, presentations, and data management and analysis.Open Specifications: Technical documents for protocols, computer languages, standards support, and data portability. The goal with Open Specifications is to help developers open new opportunities to interoperate with Windows, SQL, Office, and SharePoint.
140 questions
{count} votes

Accepted answer
  1. Tom Jebo 2,076 Reputation points Microsoft Employee
    2024-06-03T22:10:15.18+00:00

    Hi @Parth Gupta , The WordDocument stream besides the FIB is primarily document content, i.e. actual text, images, shapes, etc... that are rendered in the document. The other structures are largely in the Table streams. We don't specifically list out all the contents in the WordDocument stream for this reason. To get a more detailed idea of what is in the WordDocument stream (aside from my generalization above), you should 1) read the entire section 1.3 Overview and 2) follow one or more of the algorithms listed in section 2.4 Document Content. Then I believe you will start to understand the purpose of the various streams better.

    Best regards,
    Tom Jebo
    Microsoft Open Specifications Support

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.