RFC 822: Lexical Analysis of Messages
Topic Last Modified: 2006-06-11
A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line, that is, a line with nothing preceding the carriage return/line feed (CR/LF).
About Header Fields
Each header field is a single line of ASCII characters, and it contains a field name and a field body. A header field can be viewed as being composed of a field name, followed by a colon (:), followed by a field body, and terminated by a CR/LF.
Wrapping Field Bodies
The field-body portion of a header can be wrapped into a multiple-line representation. The general rule is that wherever there is linear-white-space (not simply LWSP-characters), a CR/LF immediately followed by at least one LWSP-character can instead be inserted.
This word-wrap capability is referred to in RFC 822 as "folding," and the process of moving from the multiple-line representation of a header field to its single-line representation is called "unfolding." Unfolding is accomplished by regarding a CR/LF immediately followed by an LWSP-character as equivalent to the LWSP-character alone.
Structure of Header Fields
The name of a header field must be composed of printable ASCII characters (characters with ASCII values between decimal 33 and 126, except colons). The header field body can be composed of any ASCII characters except CR or LF. (While CR and/or LF can be present in the actual text, they are removed when the field is unfolded.)
Certain header field bodies such as those for dates and addresses have internal structure. Others, such as "Subject" and "Comments", are regarded simply as strings of text, and are defined as <text>.
The Structure of Header Field Bodies
In RFC 822, this section describes the structure of fields used for headers. It is assumed that "lexical analyzer" code exists to interpret the bodies of header fields, and that those header fields consist of the following lexical symbols:
- Individual special characters
- Quoted-strings
- Domain-literals
- Comments
- Atoms
The usage rules for the elements and structure of header fields are supported in this section by an example.
Header Field Definitions
This section lists the syntax rules for header field names and header field bodies, for the purpose of detecting fields within messages.
Lexical Tokens
The rules in this section define a "lexical analyzer," which feeds tokens to higher-level parsers. For example:
CHAR = <any ASCII character>
DIGIT = <any ASCII decimal digit>
LWSP-char = SPACE / HTAB
linear-white-space = 1*([CRLF] LWSP-char)
atom = 1*<any CHAR except specials, SPACE and CTLs>
comment = "(" *(ctext / quoted-pair / comment) ")"
word = atom / quoted-string
The Clarifications section gives more detail regarding quoting, white space, comments, and case-sensitivity.