The Grammar Language

The grammar language is the corner stone of Xtext and is defined in itself – of course.

It is a DSL carefully designed for description of textual languages, based on LL(*)-Parsing that is like Antlr3's parsing strategy and supported by packrat parsers. The main idea is to describe the concrete syntax and how an EMF-based in-memory model is created during parsing.

First an example

To get an idea of how it works we’ll start by implementing a simple example introduced by Martin Fowler. It’s mainly about describing state machines used as the (un)lock mechanism of secret compartments. People who have secret compartments control their access in a very old-school way, e.g. by opening the door first and turning on the light afterwards.

Then the secret compartment, for instance a panel, opens up.

One of those state machines could look like this:


 events
  doorClosed  D1CL
  drawOpened  D2OP
  lightOn     L1ON
  doorOpened  D1OP
  panelClosed PNCL
 end
 
 resetEvents
  doorOpened
 end
 
 commands
  unlockPanel PNUL
  lockPanel   PNLK
  lockDoor    D1LK
  unlockDoor  D1UL
 end
 
 state idle
  actions {unlockDoor lockPanel}
  doorClosed => active
 end
 
 state active
  drawOpened => waitingForLight
  lightOn    => waitingForDraw
 end
 
 state waitingForLight
  lightOn => unlockedPanel
 end
 
 state waitingForDraw
  drawOpened => unlockedPanel
 end
 
 state unlockedPanel
  actions {unlockPanel lockDoor}
  panelClosed => idle
 end

So, we have a bunch of declared events, commands and states. Within states there are references to declared actions, which should be executed when entering such a state. Also there are transitions consisting of a reference to an event and a state. Please read Martin's description if it is not clear enough.

In order to get a complete IDE for this little language from Xtext, you need to write the following grammar:

grammar my.pack.SecretCompartments 
   with org.eclipse.xtext.common.Terminals
 generate secretcompartment "http://www.eclipse.org/secretcompartment"
 
 Statemachine :
  'events'
     (events+=Event)+
  'end'
  ('resetEvents'
     (resetEvents+=[Event])+
  'end')?
  'commands'
     (commands+=Command)+
  'end'
  (states+=State)+;
 
 Event :
  name=ID code=ID;
 
 Command :
  name=ID code=ID;
 
 State :
  'state' name=ID
     ('actions' '{' (actions+=[Command])+ '}')?
     (transitions+=Transition)*
  'end';
 
 Transition :
  event=[Event] '=>' state=[State];
  

In the following the different concepts of the grammar language are explained. We refer to this grammar when appropriate.

Language Declaration

The first line

grammar my.pack.SecretCompartments with org.eclipse.xtext.common.Terminals

declares the name of the grammar. Xtext leverages Java’s classpath mechanism. This means that the name can be any valid Java qualifier. The file name needs to correspond and have the file extension '‘xtext’'. This means that the name needs to be SecretCompartments.xtext and must be placed in package my.pack somewhere on your project’s class path.

The first line is also used to declare any used language (for mechanism details cf. Grammar Mixins).

EPackage declarations

Xtext parsers instantiate Ecore models (aka meta model). An Ecore model basically consists of an EPackage containing EClasses, EDatatypes and EEnums. Xtext can infer Ecore models from a grammar (see Meta-Model Inference) but it is also possible to instantiate existing Ecore models. You can even mix this, use multiple existing Ecore models and infer some others from one grammar.

EPackage generation

The easiest way to get started is to let Xtext infer the meta model from your grammar. This is what is done in the secret compartment example. To do so just state:

generate secretcompartment http://www.eclipse.org/secretcompartment

Which means: generate an EPackage with name secretcompartment and nsURI http://www.eclipse.org/secretcompartment (these are the properties needed to create an EPackage). The whole algorithm used to derive complete Ecore models from Xtext grammars is described in section Meta-Model Inference.

EPackage import

If you already have an existing EPackage, you can import it using either a namespace URI or a resource URI (URIs are an EMF concept).

Using namespace URIs to import existing EPackages

You can import existing EPackages using the following syntax:

import "http://www.eclipse.org/secretcompartment"

Note that if you use a namespace URI, the corresponding EPackage needs to be installed into the workbench, so that the editor can find it. At runtime (i.e. when starting the generator) you need to make sure that the corresponding EPackage is registered in EPackage.Registry.INSTANCE. If you use MWE to drive your code generator, you need to add the following lines to your workflow file:

<bean class="org.eclipse.emf.mwe.utils.StandaloneSetup" 
  platformUri="${runtimeProject}/..">
  <registerGeneratedEPackage value="foo.bar.MyPackage"/>
</bean>

using namespace URIs is typically only interesting when using common Ecore models, such as Ecore itself or the UML metamodel. If you’re developing the EPackage together with the DSL but don’t want to have it derived from the grammar for some reason, we suggest to use a resource URI.

Using resource URIs to import existing EPackages

If the EPackage you want to use is somewhere in your workspace you should refer to it using a platform:/resource/ URI. Platform URIs are a special EMF construct, which allow for some kind of transparency between workspace projects and installed bundles. Consult the EMF documentation (we recomend the book) for details.

An import statement referring to an Ecore file by a platform:/resource/ URI looks like so:

import "platform:/resource/my.project/modelFolder/MyEcore.ecore"

If you want to mix generated and imported Ecore models you’ll have to configure the generator fragment in your MWE file responsible for generating the Ecore classes ( EcoreGeneratorFragment) with resource URIs to the genmodels for the referenced Ecore models. Example:

<fragment class="org.eclipse.xtext.generator.ecore.EcoreGeneratorFragment" genModels="platform:/resource/my.project/model/myEcore.genmodel"/>

Using multiple packages / meta model aliases

If you want to use multiple EPackages you need to specify aliases like so:


  generate secretcompartment http://www.eclipse.org/secretcompartment
  import http://www.eclipse.org/anotherPackage as another

When referring to a type somewhere in the grammar you need to qualify the reference using that alias (example '‘another::CoolType’'). We’ll see later where such type references occur.

It is also supported to put multiple EPackage imports into one alias. This is no problem as long as there are no two EClassifiers with the same name. In such cases none of them are referable. It is even possible to have multiple '‘import’‘s and one ’‘generate’' declared for the same alias. If you do so, for a reference to an EClassifier first the imported EPackages are scanned before it is assumed that a type needs to by generated into the to-be-generated package.

Example:


  generate toBeGenerated http://www.eclipse.org/toBeGenerated
  import http://www.eclipse.org/packContainingClassA
  import http://www.eclipse.org/packContainingClassB

With the declaration from above

  1. a reference to type ClassA would be linked to the EClass contained in http://www.eclipse.org/packContainingClassA,

  2. a reference to type ClassB would be linked to the EClass contained in http://www.eclipse.org/packContainingClassB,

  3. a reference to type NotYetDefined would be linked to a newly created EClass in http://www.eclipse.org/toBeGenerated

Note, that using this feature is not recommended, because it might cause problems, which are hard to track down. For instance, a reference to '' classA'' would as well be linked to a newly created EClass, because the corresponding type in http://www.eclipse.org/packContainingClassA is spelled with a capital letter.

Rules

The default parsing is based on a homegrown packrat parser. It can be substituted by an Anltr parser through the Xtext service mechanism. Antlr is a sophisticated parser generator framework based on an LL(*) parsing algorithm, that works quite well for Xtext. At the moment it is advised to download the plugin de.itemis.xtext.antlr (from update site http://www.itemis.com/xtext/updatesite/milestones) and use the Antlr Parser instead of the packrat parser (cf. Xtext Workspace Setup).

Basically parsing can be separated in the following phases.

  1. lexing

  2. parsing

  3. linking

  4. validation

Terminal Rules

In the first phase, i.e. lexing, a sequence of characters (the text input) is transformed into a sequence of so called tokens. Each token consists of one or more characters and was matched by a particular terminal rule and represents an atomic symbol. In the secret compartments example there are no explicitly defined terminal rules, since it only uses the ID rule which is inherited from the grammar org.eclipse.xtext.common.Terminals (cf. Grammar Mixins). Terminal rules are also referred to as token rules or lexer rules. There is an informal naming convention that terminal-rule names are all uppercase.

Therein the ID rule is defined as follows:


 terminal ID : 
    ('^')?('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*; 
    

It says that a Token ID starts with an optional ‘^’ character, which is called caret, followed by a letter (‘a’..‘z’|‘A’..‘Z’) or underscore (‘_’) followed by any number of letters, underscores and numbers (‘0’..‘9’).

The caret is used to escape an identifier for cases where there are conflicts with keywords. It is removed by the ID rule’s ValueConverter.

This is the formal definition of terminal rules:


 TerminalRule :
   'terminal' name=ID ('returns' type=TypeRef)? ':' 
      alternatives=TerminalAlternatives ';'
 ;

Note, that the order of terminal rules is crucial for your grammar, as they may hide each other. This is especially important for newly introduced rules in connection with mixed rules from used grammars.

If you for instance want to add a rule to allow fully qualified names in addition to simple IDs, you should implement it as a datatype rule, instead of adding another terminal rule.

Return types

A terminal rule returns a value, which is a string (type ecore::EString) by default. However, if you want to have a different type you can specify it. For instance, the rule ‘INT’ is defined as:


 terminal INT returns ecore::EInt : 
   ('0'..'9')+;

This means that the terminal rule INT returns instances of ecore::EInt. It is possible to define any kind of data type here, which just needs to be an instance of ecore::EDataType. In order to tell the parser how to convert the parsed string to a value of the declared data type, you need to provide your own implementation of ‘IValueConverterService’ (cf. value converters).

The implementation needs to be registered as a service (cf. Service Framework).

This is also the point where you can remove things like quotes from string literals or the caret (‘^’) from identifiers.

Extended Backus-Naur form expressions

Token rules are described using “Extended Backus-Naur Form”-like (EBNF) expressions. The different expressions are described in the following. The one thing all of these expressions have in common is the quantifier operator. There are four different quantities

  1. exactly one (the default no operator)

  2. one or none (operator “?”)

  3. any (operator “*”)

  4. one or more (operator “+”)

Keywords / Characters

Keywords are a kind of token rule literals. The ID rule in org.eclipse.xtext.common.Terminals for instance starts with a keyword :


 terminal ID : '^'? .... ;

The question mark sets the cardinality to “none or one” (i.e. optional) like explained above.

Note that a keyword can have any length and contain arbitrary characters.

Character Ranges

A character range can be declared using the ‘..’ operator.

Example:

terminal INT returns ecore::EInt: ('0'..'9')+ ;

In this case an INT is comprised of one or more (note the ‘+’ operator) characters between (and including) ‘0’ and ‘9’.

Wildcard

If you want to allow any character you can simple write a dot: Example:

FOO : 'f' . 'o';

The rule above would allow expressions like ‘foo’, ‘f0o’ or even ‘f\no’.

Until Token

With the until token it is possible to state that everything should be consumed until a certain token occurs. The multi line comment is implemented using it:

terminal ML_COMMENT : '/*' -> '*/' ;

This is the rule for Java-style comments that begin with ‘/*’ and end with ‘*/’.

Negated Token

All the tokens explained above can be inverted using a preceding explanation mark:

terminal ML_COMMENT : '/*' (!'*/')+ ;

Rule Calls

Rules can refer to other rules. This is done by writing the name of the rule to be called. We refer to this as rule calls. Rule calls in terminal rules can only point to terminal rules.

Example:

terminal QUALIFIED_NAME : ID ('.' ID)*;

Alternatives

Using alternatives one can state multiple different alternatives. For instance, the whitespace rule uses alternatives like so:

terminal WS : (' '|'\t'|'\r'|'\n')+ ;

That is a WS can be made of one or more whitespace characters (including ‘ ’,‘\t’,‘\r’,‘\n’)

Groups

Finally, if you put tokens one after another, the whole sequence is referred to as a group. Example:

terminal FOO : '0x' ('0'..'7') ('0'..'9'|'A'..'F') ;

That is the 4-digit hexadecimal code of ascii characters.

Parser Rules

The parser reads in a sequence of terminals and walks through the parser rules. That’s why a parser rule – contrary to a terminal rule – does not produce a single terminal token but a tree of non-terminal and terminal tokens that lead to a so called parse tree (in Xtext it is a.k.a node model). Furthermore, parser rules are handled a as kind of a building plan for how to create EObjects that form the semantic model (the linked! AST). The different constructs like actions and assignments are used to derive types and initialize the semantic objects accordingly.

Extended Backus-Naur Form expressions

In parser rules (as well as in datatype rules) not all the expressions available for terminal rules can be used. Character ranges, wildcards, the until token and the negation are currently only available for terminal rules. Available in parser rules as well as in terminal rules are

  1. groups,

  2. alternatives,

  3. keywords and

  4. “rule calls”#RuleCalls.

In addition to these elements, there are some expressions used to direct how the AST is constructed, which are listed and explained in the following.

Assignments

Assignments are used to assign parsed information to a feature of the current EClass. The current EClass is specified by the return type resp. if not explicitly stated it is implied that the type’s name equals the rule’s name.

Example:


 State :
  'state' name=ID
     ('actions' '{' (actions+=[Command])+ '}')?
     (transitions+=Transition)*
  'end';

The syntactic declaration for states in the state machine example starts with a keyword ‘state’ followed by an assignment :

name=ID

Where the left hand side refers to a feature of the current EClass (in this case EClass ‘State’). The right hand side can be a rule call, a keyword, a cross reference (explained later) or even an alternative comprised by the former. The type of the feature needs to be compatible with the type of the expression on the right. As ID returns an EString the feature name needs to be of type EString as well.

Assignment Operators

There are three different assignment operators, each with different semantics

  1. the simple equal sign “=” is the straight forward assignment, and used for features which take only one element

  2. the “+=” sign (the add operator) expects a multi valued feature and adds the value on the right hand to that feature, which is, of course, a list feature

  3. the “?=” sign (boolean add operator) expects a feature of type EBoolean and sets it to true if the right hand side was consumed (no matter with what values)

Cross References

A unique feature of Xtext is the ability to declare cross links in the grammar. In traditional compiler construction the cross links are not established during parsing but in a later linking phase. This is the same in Xtext, but we allow to specify cross link information in the grammar, which is used during the linking phase. The syntax for cross links is:


 CrossReference :
   '[' type=TypeRef ('|' ^terminal=CrossReferenceableTerminal )? ']'
 ;

For example, the transition is made up of two cross references, pointing to an event and a state:


 Transition :
  event=[Event] '=>' state=[State];

It is important to understand that the text between the square brackets does not refer to another rule, but to the type! This is sometimes confusing, because one usually uses the same name for the rules and the types. That is if we had named the type for events differently like in the following the cross references needs to be adapted as well:


 Transition :
  event=[MyEvent] '=>' state=[State];
 
 Event returns MyEvent : ....;

Looking at the syntax definition of cross references, there is an optional part starting with a vertical bar followed by ‘CrossReferenceableTerminal’. This is the part describing the concrete text, from which the crosslink later should be established. By default (that’s why it’s optional) it is “|ID”.

You may even use alternatives as the referencable terminal. This way, either an ID or a STRING may be used as the referencable terminal, as it is possible in many SQL dialects.

TableRef: table=[Table|(ID|STRING)];

Have a look at the linking section in order to understand how linking is done.

Simple Actions

By default the object to be returned by a parser rule is created lazily on the first assignment. Then the type of the EObject to be created is determined from the specified return type (or the rule name if not explicit return type is specified). With Actions however, creation of EObject can be made explicit. Xtext supports two kinds of Actions:

  1. simple actions

  2. assigned actions.

If at some point you want to enforce creation of a specific type you can use alternatives or simple actions. In the following example TypeB must be a subtype of TypeA. An expression A ident should create an instance of TypeA, whereas B ident should instantiate TypeB.

Example with alternatives:


 MyRule returns TypeA :
   "A" name=ID |
   MyOtherRule; 
 
 MyOtherRule returns TypeB :
   "B" name = ID;

Example with simple actions:


 MyRule returns TypeA :
   "A" name=ID |
   "B" {TypeB} name=ID; 

Generally speaking, the instance is created as soon as the parser hits the first assignment. However, Actions allow to explicitly instantiate any EClass. The notation {TypeB} will create an instance of TypeB and assign it to the result of the parser rule. This allows parser rules without any assignment and object creation without the need to introduce stub-rules.

Unassigned rule calls

We previously explained, that the EObject to be returned is created lazily when the first assignment occurs or when a simple action is evaluated. There is another way one can set the EObject to be returned, which we call an “unassigned rule call”.

Unassigned rule calls (the name suggests it) are rule calls to other parser rules, which are not used within an assignment. If there is no feature the returned value shall be assigned to, the value is assigned to the “to-be-returned” reference.

With unassigned rule calls one can, for instance, create rules which just dispatch between several other rules:


 AbstractToken :
    TokenA |
    TokenB |
    TokenC;

As AbstractToken could possibly return an instance of TokenA, TokenB or TokenC its type must by a super type to these types.

It is now for instance as well possible to further change the state of the AST element by assigning additional things.

Example:


 AbstractToken :
   (TokenA |
    TokenB |
    TokenC ) (cardinality=('?'|'+'|'*'))?;

Thus, to state the cardinality is optional (last question mark) and can be represented by a question mark, a plus, or an asterisk.

Tree Rewrite Actions

LL parsing has some significant advantages over LR algorithms. The most important ones for Xtext are, that the generated code is much simpler to understand and debug and that it is easier to recover from errors and especially Antlr has a very nice generic error recovery mechanism. This allows to have AST constructed even if there are syntactic errors in the text. You wouldn’t get any of the nice IDE features as soon as there is one little error, if we hadn’t error recovery.

However, LL also has some drawbacks. The most important is, that it does not allow left recursive grammars. For instance, the following is not allowed in LL based grammars, because “Expression '+' Expression” is left recursive:

Expression :
  Expression '+' Expression |
  '(' Expression ')'
  INT;

Instead one has to rewrite such things by “left-factoring” it:

Expression :
  TerminalExpression ('+' TerminalExpression)?;
 
TerminalExpression :
  '(' Expression ')' |
   INT 

In practice this is always the same pattern and therefore not that problematic. However, by simply applying the Xtext AST construction features we’ve covered so far, a grammar ...

Expression :
  {Operation} left=TerminalExpression (op='+' right=TerminalExpression)?;
 
TerminalExpression returns Expression:
  '(' Expression ')' |
  {IntLiteral} value=INT;

... would result in unwanted elements in the AST. For instance the expression “ ( 42 ) ” would result in a tree like this:

Operation {
  left=Operation {
    left=IntLiteral {
      value=42
    }
  }
}

Typically one would only want to have one instance of IntLiteral instead.

One can solve this problem using a combination of unassigned rule calls and actions:

Expression :
  TerminalExpression ({Operation.left=current} 
    op='+' right=TerminalExpression)?;
 
TerminalExpression returns Expression:
  '(' Expression ')' |
  {IntLiteral} value=INT;

In the example above {Operation.left=current} is a so called tree rewrite action, which creates a new instance of the stated EClass (Operation in this case) and assigns the element currently to-be-returned ( current variable) to a feature of the newly created Object (in this case ‘left’). In Java these semantics could be expressed as:

Operation temp = new Operation();
temp.setLeft(current);
current = temp;

Hidden terminal symbols

Because parser rules describe not a single token, but a sequence of patterns in the input, it is necessary to define the interesting parts of the input. Xtext introduces the concept of hidden tokens to handle semantically unimportant things like whitespaces, comments, etc. in the input sequence gracefully. It is possible to define a set of terminal symbols, that are hidden from the parser rules and automatically skipped when they are recognized. Nevertheless, they are transparently woven into the node model, but not relevant for the semantic model.

Hidden terminals may (or may not) appear between any other terminals in any cardinality. They can be described per rule or for the whole grammar. When reusing a single grammar its definition of hidden tokens is reused as well. The grammar org.eclipse.xtext.common.Terminals comes with a reasonable default and hides all comments and whitespace from the parser rules.

If a rule defines hidden symbols, you can think of a kind of scope that is automatically introduced. Any rule that is called from the declaring rule uses the same hidden terminals as the calling rule, unless it defines other hidden tokens itself.


 Person hidden(WS, ML_COMMENT, SL_COMMENT): 
   name=fullname age=INT ';';

 Fullname: 
   (firstname=ID)? lastname=ID;

The sample rule “Person” defines multiple-line comments (ML_COMMENT), single-line comments (SL_COMMENT), and whitespaces (WS) to be allowed between the ‘Fullname’ and the ‘age’. Because ‘Fullname’ does not introduce another set of hidden terminals, it allows the same symbols to appear between ‘firstname’ and ‘lastname’ as the calling rule ‘Person’. Thus, the following input is perfectly valid for the given grammar snippet:


 John /* comment */ Smith // line comment
   /* comment */
   42      ;

A list of all default terminals like WS can be found in section Grammar Mixins.

Datatype rules

Datatype rules are parsing-phase rules, which create instances of EDataType as terminal rules do. The nice thing about datatype rules is that they are actually parser rules and are therefore

  1. context sensitive and

  2. allow for use of hidden tokens

If you, for instance, want to define a rule to consume Java-like qualified names (e.g. “foo.bar.Baz”) you could write:


 QualifiedName :
   ID ('.' ID)*;

Which looks similar to the terminal rule we’ve defined above in order to explain rule calls. However, the difference is that because it is a parser rule and therefore only valid in certain contexts, it won’t conflict with the rule ID. If you had defined it as a terminal rule, it would hide the ID rule.

In addition having this defined as a datatype rule, it is allowed to use hidden tokens (e.g. "/* comment /") between the IDs and dots (e.g. @foo/ comment */. bar . Baz"@)

Return types can be specified like in terminal rules:

QualifiedName returns ecore::EString : ID ('.' ID)*;

Note that if a rule does not call another parser rule and does not contain any actions nor assignments (see parser rules), it is considered a datatype rule and the datatype EString is implied if not explicitly declared differently.

For conversion again value converters are responsible (cf. value converters).

Enum Rules

Enum rules return enumeration literals from strings. They can be seen as a shortcut for datatype rules with specific value converters. The main advantage of enum rules is their simplicity, typesafety and therefore nice validation. Furthermore it is possible to infer enums and their respective literals during the metamodel transformation.

If you want to define a ChangeKind org.eclipse.emf.ecore.change/model/Change.ecore with ‘ADD’, ‘MOVE’ and ‘REMOVE’ you could write:


 enum ChangeKind :
   ADD | MOVE | REMOVE;

It is even possible to use alternative literals for your enums or reference an enum value twice:


 enum ChangeKind :
   ADD = 'add' | ADD = '+' | 
   MOVE = 'move' | MOVE = '->' | 
   REMOVE = 'remove' | REMOVE = '-';

Please note, that Ecore does not support unset values for enums. If you formulate a grammar like


 Element: "element" name=ID (value=SomeEnum)?;

with the input of


 element Foo

the resulting value of the element Foo will hold the enum value with the internal representation of 0. When generating the EPackage from your grammar this will be the first literal you define. As a workaround you could introduce a dedicated none-value or order the enums accordingly. Note that it is not possible to define an enum literal with an empty textual representation.


 enum Visibility: package | private | protected | public;