Syntax Rules

發行項
09/22/2010

[This content is no longer valid. For the latest information on "M", "Quadrant", SQL Server Modeling Services, and the Repository, see the Model Citizen blog.]

Syntax rules are processed after token rules and interleave rules. Thus the input to the syntax rules is a stream of tokens, with no whitespace characters interspersed. The syntax rules together with the token and interleave rules completely define a domain specific language (DSL).

The format for a syntax rule is the keyword syntax, followed by a rule name, followed by an equal sign (=), followed by varying components on the right side of the equal sign. The right side of a syntax statement may contain a large number of components, many of which are optional, so syntax statements may be fairly complex.

Examples of Syntax Statements

In this section, a grammar is used that defines the Microsoft code name “M” modeling language itself as an example of a DSL grammar. Note that the grammar in this example is highly simplified: “M” is much more complex than this grammar indicates.

Main

Syntax rules form a hierarchy that can be built by referencing other rules on the right side. At the top of the hierarchy is a rule named Main, which is required in every MGrammar DSL. In the following code is a Main rule followed by two other rules. Each rule references another rule, and so form a simple hierarchy.

        syntax Main             = CompilationUnit;
        syntax CompilationUnit  = Module*; 
        syntax Module           = ModuleDeclaration;

The Main rule points to the CompilationUnit rule, which suggests that this is the grammar for a single “M” file, or "compilation unit". The CompilationUnit rule matches on multiple Module rules. These Module rules corresponds to the fact that in “M”, a compilation unit is made up of 0-to-many modules and the Module rule matches on a ModuleDeclaration node.

Pattern Rules

The ModuleDeclaration rule is more complex. The right side of a rule is made up of one or more productions, which are separated by the "or" symbol (|). In this case we have only a single production. However this production contains multiple terms. In this example, the terms consist of several rule references combined with literals.

        syntax ModuleDeclaration 
            = "module" DottedIdentifiers "{" ExportDirective* ModuleMemberDeclaration* "}"  ";"?
            ;

This rule says that a ModuleDeclaration is made up of the literal "module", which happens to be an “M” keyword, a DottedIdentifiers, a "{" literal, 0-to-many ExportDirective nodes, 0-to-many ModuleMemberDeclaration nodes, followed by the "}" literal, followed by the ";" literal. The ";" literal is followed by a "?" literal, which indicates that the semi-colon following the braces is optional. This is exactly what a module looks like in the “M” language, except that this example omits a number of variations.

Note that MGrammar considers any literals that occur inside syntax rules to be anonymous tokens. Even if they are not formally declared in a token rule, the MGrammar parser treats them as tokens.

Rules with Alternative Productions

The ModuleMemberDeclaration rule contains several productions, which correspond to the four main types of “M” statements:

ExtentDeclaration
ComputedValueDeclaration
TypeDeclaration
TopScopeGraphInitializationExpression, which represents an MGraph value.

syntax ModuleMemberDeclaration = 
                              ExtentDeclaration
                            | ComputedValueDeclaration
                            | TypeDeclaration
                            | TopScopeGraphInitializationExpression;

Each of the productions in this rule refer to another rule.

Optional Syntax Rule Components

There are numerous additional optional components that syntax rules may contain. These components are discussed in detail in other topics.

At the rule level, attributes can be applied and the rule can be parameterized.

At the production level, a precedence declaration can be applied to resolve ambiguity. And you can specify a constructor to shape and customize the output syntax tree that is created by a successful parse of an input stream, if the default syntax tree does not meet your requirements.

At the term level, you can bind abbreviated variable names to the term for reference in the output constructor. You can specify numerous attributes. You can apply a precedence declaration to resolve ambiguity. The term expression itself can be any of a variety of forms: a rule reference, a text literal (generates an anonymous token), a character range expression and various other expressions.

How Syntax Rules are Processed

When an MGrammar program is compiled, a parser is created. When a text stream is fed into the parser, the text stream is initially converted into a stream of tokens and whitespace is removed from the stream.

The parser next reads the token stream one token at a time and attempts to arrive at a valid abstract syntax tree. The following code shows how the parser is processed, which is an empty module in “M”.

module Southwind
{
}

This input stream gets tokenized into the following tokens:

The keyword module.
The IdentifierName token Southwind.
The anonymous token "{".
The anonymous token "{".

The first token the parser encounters is the keyword module, which matches the first term on the right side of the ModuleDeclaration syntax rule.

The next token is the IdentifierName. This was derived during tokenization by the following rules.

token IdentifierName = IdentifierBegin IdentifierCharacter*;
token IdentifierBegin = '_' | Letter;
token IdentifierCharacter  = IdentifierBegin
                           | '$'
                           | DecimalDigit;
token Letter = 'a'..'z' | 'A'..'Z';

But this does not match what is required by the ModuleDeclaration syntax rule. However the parser is able to use the following syntax rules to show that an IdentifierName token is in fact a DottedIdentifiers node. The following code shows the rules that are applied.

syntax DottedIdentifiers = ParsedIdentifier
                         | DottedIdentifiers "." ParsedIdentifier;
syntax ParsedIdentifier = IdentifierVerbatim
                        | IdentifierName;
token IdentifierVerbatim = '@[' IdentifierVerbatimCharacters ']';

The parser now can easily verify that the following two "brace" tokens match the rule, because the insides of the brace are optional. In other words, an empty module declaration is a valid statement in “M”.

If you consider a more complex example where there are actual “M” language statements inside the module, the parser follows a similar procedure in evaluating each statement.

共用方式為