Token Rules

Artículo
09/22/2010

[This content is no longer valid. For the latest information on "M", "Quadrant", SQL Server Modeling Services, and the Repository, see the Model Citizen blog.]

Token rules are the first rules to be processed against a stream of input text. Their function is to break the input stream into tokens by defining the language's keywords, and the literals associated with keywords. These tokens are then processed by interleave and syntax rules.

Token Rule Examples

This section illustrates several different uses of token rules.

Keywords

The Microsoft code name “M” modeling language itself can be written in MGrammar, and is included as one of the SDK samples. The following code shows how several common “M” keywords are defined.

        @{Classification["Keyword"]} final token ModuleKeyword = "module";
        @{Classification["Keyword"]} final token TypeKeyword = "type";
        @{Classification["Keyword"]} final token ImportKeyword = "import";
        @{Classification["Keyword"]} final token ExportKeyword = "export";

These rules describe what the language parser does when it finds a character string that matches the right side of one of these rules. The parser's goal is to build a syntax tree that reflects the rules of the language. The first step is to tokenize the input stream by replacing strings that matches the right side of token rules with a token, which is a data structure with a name and value. For example, when the string type is found in the input stream, it is replaced with a TypeKeyword token.

Note the use of the Classification attribute, which is set to Keyword. The “M” modeling language services use this to implement syntax coloring in the code name “Intellipad” tool editor. If your language contains keywords that you want to be syntax colored by “Intellipad”, use this attribute.

Also note the use of the final keyword: in cases of potential ambiguity in the input stream, this keyword ensures that this rule is evaluated last.

Defining Literals

You can also use token rules to define allowable literal values for identifiers or other literals in the input text that are associated with keywords. In the previous example, the right side of the rules were character strings that represented keywords. This example is more complex, in that, the right sides of rules can also be references to other rules, which imposes a hierarchical structure on the rules.

The following code defines a rule for IdentifierCharacters by chaining together multiple token rules.

        token Letter = 'a'..'z' | 'A'..'Z';
        token IdentifierBegin = '_' | Letter;
        token DecimalDigit =  '0'..'9';
        token IdentifierCharacter 
            = IdentifierBegin
            | '$'
            | DecimalDigit;
        token IdentifierCharacters = IdentifierCharacter+;

Proceeding in a top-down direction, the top-most rule specifies that the IdentifierCharacters token equals one or more IdentifierCharacter tokens. Note the use of the Kleene operator +. Other Kleene operators such as * and ? can also be used.

The next rule down tells us that the IdentifierCharacter can be one of three different things:

A IdentifierBegin token;
The character "$".
A DecimalDigit token.

Note the use of the "or" operator (|).

Each of these alternatives is referred to as a production, and each production can be composed of multiple terms, although in this case each production contains only a single term.

Going down to the next level of rules, one alternative is that a DecimalDigit token is an integer between the values of 0 through 9. Note the use of the .. operator to specify a range of values.

Another alternative specifies that an IdentifierBegin token consists of either the character "_", or the token Letter, and this latter token consists of an upper or lower case alphabetic character.

How Tokenizing is Done

A language parser transforms an input stream into an Abstract Syntax Tree (AST), which consists of nodes, each of which represents a rule (syntax, interleave, or token) that is successfully applied. The leaf nodes consist of tokens, which are nodes that represent strings that are language keywords, or literals associated with a keyword.

The first step a parser performs is to transform the input stream of characters into a stream of tokens. As a first approximation, the parser looks through the input stream, and when it finds a match with a token rule, it replaces the matching characters with a token. If it finds characters that do not match a token, then it performs some kind of error processing.

Apply this procedure to the fragments of the “M” modeling language represented by the two code samples previously listed. The language input consists of a single character "t". This matches the token rule for Letter, so is replaced by the Letter token. This token matches the right side of the IdentifierBegin token, so is replaced by that token. And likewise the IdentifierBegin token matches the IdentifierCharacter rule, and then the IdentifierCharacters rule, so ultimately the character "A" is replaced by the IdentifierCharacters token.

You can see what happens if the language input is the character stream "type". If the parser reads a character and completely processed it before reading the next one, you would get a sequence of four IdentifierCharacters tokens, one for each character in the "type" character stream, following the procedure outlined in the preceding paragraph. But that is not what happens: when processing the first character and determining that it matches the Letter rule, it then reads the next character, finds that it too matches the Letter rule. It then finds that two Letter tokens can be resolved to an IdentifierCharacters token. Now it reads the third character, and similarly finds that three Letter tokens can be resolved to an IdentifierCharacters token. And similarly after reading the fourth character it finds that four Letter tokens can be resolved to an IdentifierCharacters token.

But the "type" character stream also matches the TypeKeyword rule. So how does the parser know whether "type" should be tokenized as an IdentifierCharacters token, or as a TypeKeyword token?

Ambiguity and Precedence

A language whose parser can parse some input in more than one way is said to be ambiguous. MGrammar applies a number of internal precedence rules to avoid ambiguity. One such rule says that a rule that involves specific literals (for example, the character string "type") take precedence over rules that can match inputs of varying lengths, such as the IdentifierCharacters rule. This precedence rule causes the input stream "type" to be tokenized as a TypeKeyword token.

Another internal precedence rule states that "longer terms win", as shown in the following grammar.

module HelloWorld 
{
    language HelloWorld 
    {
        syntax Main = (tType) 
                    | (tTyp tEz);
        token tType = "type";
        token tTyp = "typ"; 
        token tEz = "e";
    }
}

This grammar can parse the input string "type" in two different ways:

As the token tType.
As the token tTyp followed by the token tEz.

Because of the "longer term wins" rule, it is parsed as the single token tType. You can verify this with “Intellipad” by loading the preceding code and passing it the string "type". You can see how the input was parsed in the output pane.

There are also explicit keywords in MGrammar that enable you to resolve ambiguity. The precedence keyword can be applied to the different productions of a rule to specify the order in which to consider the productions. The left and right keywords can be applied to terms in a production in case it is necessary to resolve associative ambiguity.

Compartir a través de