Token Rules
[This content is no longer valid. For the latest information on "M", "Quadrant", SQL Server Modeling Services, and the Repository, see the Model Citizen blog.]
Token rules are the first rules to be processed against a stream of input text. Their function is to break the input stream into tokens by defining the language's keywords, and the literals associated with keywords. These tokens are then processed by interleave and syntax rules.
Token Rule Examples
This section illustrates several different uses of token rules.
Keywords
The Microsoft code name “M” modeling language itself can be written in MGrammar, and is included as one of the SDK samples. The following code shows how several common “M” keywords are defined.
@{Classification["Keyword"]} final token ModuleKeyword = "module";
@{Classification["Keyword"]} final token TypeKeyword = "type";
@{Classification["Keyword"]} final token ImportKeyword = "import";
@{Classification["Keyword"]} final token ExportKeyword = "export";
These rules describe what the language parser does when it finds a character string that matches the right side of one of these rules. The parser's goal is to build a syntax tree that reflects the rules of the language. The first step is to tokenize the input stream by replacing strings that matches the right side of token rules with a token, which is a data structure with a name and value. For example, when the string type is found in the input stream, it is replaced with a TypeKeyword
token.
Note the use of the Classification
attribute, which is set to Keyword
. The “M” modeling language services use this to implement syntax coloring in the code name “Intellipad” tool editor. If your language contains keywords that you want to be syntax colored by “Intellipad”, use this attribute.
Also note the use of the final
keyword: in cases of potential ambiguity in the input stream, this keyword ensures that this rule is evaluated last.
Defining Literals
You can also use token rules to define allowable literal values for identifiers or other literals in the input text that are associated with keywords. In the previous example, the right side of the rules were character strings that represented keywords. This example is more complex, in that, the right sides of rules can also be references to other rules, which imposes a hierarchical structure on the rules.
The following code defines a rule for IdentifierCharacters
by chaining together multiple token rules.
token Letter = 'a'..'z' | 'A'..'Z';
token IdentifierBegin = '_' | Letter;
token DecimalDigit = '0'..'9';
token IdentifierCharacter
= IdentifierBegin
| '$'
| DecimalDigit;
token IdentifierCharacters = IdentifierCharacter+;
Proceeding in a top-down direction, the top-most rule specifies that the IdentifierCharacters
token equals one or more IdentifierCharacter
tokens. Note the use of the Kleene operator +. Other Kleene operators such as * and ? can also be used.
The next rule down tells us that the IdentifierCharacter
can be one of three different things:
A
IdentifierBegin
token;The character "$".
A
DecimalDigit
token.
Note the use of the "or" operator (|).
Each of these alternatives is referred to as a production, and each production can be composed of multiple terms, although in this case each production contains only a single term.
Going down to the next level of rules, one alternative is that a DecimalDigit
token is an integer between the values of 0 through 9. Note the use of the .. operator to specify a range of values.
Another alternative specifies that an IdentifierBegin
token consists of either the character "_", or the token Letter
, and this latter token consists of an upper or lower case alphabetic character.
How Tokenizing is Done
A language parser transforms an input stream into an Abstract Syntax Tree (AST), which consists of nodes, each of which represents a rule (syntax, interleave, or token) that is successfully applied. The leaf nodes consist of tokens, which are nodes that represent strings that are language keywords, or literals associated with a keyword.
The first step a parser performs is to transform the input stream of characters into a stream of tokens. As a first approximation, the parser looks through the input stream, and when it finds a match with a token rule, it replaces the matching characters with a token. If it finds characters that do not match a token, then it performs some kind of error processing.
Apply this procedure to the fragments of the “M” modeling language represented by the two code samples previously listed. The language input consists of a single character "t". This matches the token rule for Letter
, so is replaced by the Letter
token. This token matches the right side of the IdentifierBegin
token, so is replaced by that token. And likewise the IdentifierBegin
token matches the IdentifierCharacter
rule, and then the IdentifierCharacters
rule, so ultimately the character "A" is replaced by the IdentifierCharacters
token.
You can see what happens if the language input is the character stream "type". If the parser reads a character and completely processed it before reading the next one, you would get a sequence of four IdentifierCharacters
tokens, one for each character in the "type" character stream, following the procedure outlined in the preceding paragraph. But that is not what happens: when processing the first character and determining that it matches the Letter
rule, it then reads the next character, finds that it too matches the Letter
rule. It then finds that two Letter
tokens can be resolved to an IdentifierCharacters
token. Now it reads the third character, and similarly finds that three Letter
tokens can be resolved to an IdentifierCharacters
token. And similarly after reading the fourth character it finds that four Letter
tokens can be resolved to an IdentifierCharacters
token.
But the "type" character stream also matches the TypeKeyword
rule. So how does the parser know whether "type" should be tokenized as an IdentifierCharacters
token, or as a TypeKeyword
token?
Ambiguity and Precedence
A language whose parser can parse some input in more than one way is said to be ambiguous. MGrammar applies a number of internal precedence rules to avoid ambiguity. One such rule says that a rule that involves specific literals (for example, the character string "type") take precedence over rules that can match inputs of varying lengths, such as the IdentifierCharacters
rule. This precedence rule causes the input stream "type" to be tokenized as a TypeKeyword
token.
Another internal precedence rule states that "longer terms win", as shown in the following grammar.
module HelloWorld
{
language HelloWorld
{
syntax Main = (tType)
| (tTyp tEz);
token tType = "type";
token tTyp = "typ";
token tEz = "e";
}
}
This grammar can parse the input string "type" in two different ways:
As the token
tType
.As the token
tTyp
followed by the tokentEz
.
Because of the "longer term wins" rule, it is parsed as the single token tType
. You can verify this with “Intellipad” by loading the preceding code and passing it the string "type". You can see how the input was parsed in the output pane.
There are also explicit keywords in MGrammar that enable you to resolve ambiguity. The precedence keyword can be applied to the different productions of a rule to specify the order in which to consider the productions. The left and right keywords can be applied to terms in a production in case it is necessary to resolve associative ambiguity.