In the following the different concepts of the grammar language are explained.
The first line
grammar my.pack.SecretCompartments with org.eclipse.xtext.common.Terminals
declares the name of the grammar. Xtext leverages Java’s class path mechanism. This means that the name can be any valid Java qualifier. The file name needs to correspond to the grammar name and have the file extension ‘xtext’. This means that the name has to be
SecretCompartments.xtext
and must be placed in a package
my.pack
somewhere on your project’s class path.
The first line is also used to declare any used language (for mechanism details see Grammar Mixins).
Xtext parsers instantiate Ecore models (aka meta model). An Ecore model basically consists of an EPackage containing EClasses, EDatatypes and EEnums. Xtext can infer Ecore models from a grammar (see Ecore model inference) but it is also possible to instantiate existing Ecore models. You can even mix this and use multiple existing Ecore models and infer some others from one grammar.
The easiest way to get started is to let Xtext infer the Ecore model from your grammar. This is what is done in the secret compartment example. To do so just state:
generate secretcompartment 'http://www.eclipse.org/secretcompartment'
That statement means: generate an EPackage with the name
secretcompartment
and the nsURI
http://www.eclipse.org/secretcompartment
. Actually these are the properties that are required to create an EPackage.
The whole algorithm used to derive complete Ecore models from Xtext grammars is described in the section
Ecore model inference.
If you already have an existing EPackage, you can import it using either a namespace URI or a resource URI. An URI (Uniform Resource Identifier) provides a simple and extensible means for identifying an abstract or physical resource. For details about the EMF implementation of this concept, please see its documentation.
You can import an existing EPackage with the following statement:
import 'http://www.eclipse.org/secretcompartment'
Note that if you use a namespace URI, the corresponding EPackage needs to be installed into the workbench. Otherwise the editor cannot find it.
At runtime (i.e. when starting the generator) you need to make sure that the corresponding EPackage is registered in the
EPackage.Registry.INSTANCE
. If you use MWE to drive your code generator, you need to add the following lines to your workflow file:
<bean class="org.eclipse.emf.mwe.utils.StandaloneSetup"
platformUri="${runtimeProject}/..">
<registerGeneratedEPackage value="my.pack.SecretcompartmentPackage"/>
</bean>
Using namespace URIs is typically only interesting when common Ecore models are used, such as Ecore itself or the UML meta model. If you’re developing the EPackage together with the DSL but don’t want to have it derived from the grammar for some reason, we suggest to use a resource URI.
If the EPackage you want to use is somewhere in your workspace you should refer to it by a
platform:/resource/
-URI.
Platform URIs are a special EMF concept, which allow for some kind of transparency between workspace projects and installed bundles. Consult the EMF documentation (we recommend the book) for details.
An import statement referring to an Ecore file by a platform:/resource/ URI looks like this:
import 'platform:/resource/project/src/my/pack/SecretCompartments.ecore'
If you want to mix generated and imported Ecore models you’ll have to configure the generator fragment in your MWE file responsible for generating the Ecore classes ( EcoreGeneratorFragment) with resource URIs that point to the genmodels for the referenced Ecore models. Example:
<fragment class="org.eclipse.xtext.generator.ecore.EcoreGeneratorFragment"
genModels=
"platform:/resource/project/src/my/pack/SecretCompartments.genmodel"/>
If you want to use multiple EPackages you need to specify aliases in the following way:
generate secretcompartment 'http://www.eclipse.org/secretcompartment'
import 'http://www.eclipse.org/anotherPackage' as another
When referring to a type somewhere in the grammar you need to qualify the reference using that alias (example
another::CoolType
). We’ll see later where such type references occur.
It is also supported to put multiple EPackage imports into one alias. This is no problem as long as there are not any two EClassifiers with the same name. In such cases none of them can be referenced. It is even possible to ‘import’ multiple and ‘generate’ one Ecore model and all of them are declared for the same alias. If you do so, for a reference to an EClassifier first the imported EPackages are scanned before it is assumed that a type needs to by generated into the to-be-generated package.
Example:
generate toBeGenerated 'http://www.eclipse.org/toBeGenerated'
import 'http://www.eclipse.org/packContainingClassA'
import 'http://www.eclipse.org/packContainingClassB'
With the declaration above
a reference to type
ClassA
would be linked to the EClass contained in
http://www.eclipse.org/packContainingClassA
,
a reference to type
ClassB
would be linked to the EClass contained in
http://www.eclipse.org/packContainingClassB
,
a reference to type
NotYetDefined
would be linked to a newly created EClass in
http://www.eclipse.org/toBeGenerated
.
Note, that using this feature is not recommended, because it might cause problems, which are hard to track down.
For instance, a reference to
classA
would as well be linked to a newly created EClass, because the corresponding type in
http://www.eclipse.org/packContainingClassA
is spelled with a capital letter.
The default parsing is based on a home-grown packrat parser. It is advised to substitute it by an ANTLR parser through the Xtext service mechanism. ANTLR is a sophisticated parser generator framework based on an LL(*)
parsing algorithm, that works quite well for Xtext. Please download the plugin de.itemis.xtext.antlr (from update site
http://download.itemis.com/updates/
) and use the ANTLR Parser instead of the packrat parser (cf.
Xtext Workspace Setup).
Basically parsing can be separated in the following phases.
lexing
parsing
linking
validation
In the first stage, i.e. lexing, a sequence of characters (the text input) is transformed into a sequence of so called tokens. In this context, a token is sort of a strongly typed part of the input sequence. It consists of one or more characters and was matched by a particular terminal rule or keyword and therefore represents an atomic symbol. Terminal rules are also referred to as token rules or lexer rules. There is an informal naming convention that names of terminal rules are all upper-case.
In the secret compartments example there are no explicitly defined terminal rules, since it only uses the ID rule which is inherited from the grammar
org.eclipse.xtext.common.Terminals
(cf.
Grammar Mixins).
Therein the ID rule is defined as follows:
terminal ID :
('^')?('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
It says that a Token ID starts with an optional ‘^’ character (caret), followed by a letter (‘a’..‘z’|‘A’..‘Z’) or underscore (‘_’) followed by any number of letters, underscores and numbers (‘0’..‘9’).
The caret is used to escape an identifier for cases where there are conflicts with keywords. It is removed by the ID rule’s ValueConverter.
This is the formal definition of terminal rules:
TerminalRule :
'terminal' name=ID ('returns' type=TypeRef)? ':'
alternatives=TerminalAlternatives ';'
;
Note, that the order of terminal rules is crucial for your grammar, as they may hide each other. This is especially important for newly introduced rules in connection with mixed rules from used grammars.
If you for instance want to add a rule to allow fully qualified names in addition to simple IDs, you should implement it as a data type rule, instead of adding another terminal rule.
A terminal rule returns a value, which is a string (type
ecore::EString
) by default.
However, if you want to have a different type you can specify it. For instance, the rule ‘INT’ is defined as:
terminal INT returns ecore::EInt :
('0'..'9')+;
This means that the terminal rule ‘INT’ returns instances of
ecore::EInt
. It is possible to define any kind of data type here, which just needs to be an instance of
ecore::EDataType
.
In order to tell the parser how to convert the parsed string to a value of the declared data type, you need to provide your own implementation of
IValueConverterService (cf.
value converters). The value converter is also the point where you can remove things like quotes from string literals or the caret (‘^’) from identifiers. Its implementation needs to be registered as a service (cf.
Service Framework).
Token rules are described using “Extended Backus-Naur Form”-like (EBNF) expressions. The different expressions are described in the following. The one thing all of these expressions have in common is the cardinality operator. There are four different possible cardinalities
exactly one (the default, no operator)
one or none (operator “?”)
any (zero or more, operator “*”)
one or more (operator “+”)
Keywords are a kind of token rule literals.
The ‘ID’ rule in
org.eclipse.xtext.common.Terminals
for instance starts with a keyword:
terminal ID : '^'? .... ;
The question mark sets the cardinality to “none or one” (i.e. optional) like explained above.
Note that a keyword can have any length and contain arbitrary characters.
A character range can be declared using the ‘..’ operator.
Example:
terminal INT returns ecore::EInt: ('0'..'9')+;
In this case an INT is comprised of one or more (note the ‘+’ operator) characters between (and including) ‘0’ and ‘9’.
If you want to allow any character you can simple write the wildcard operator ‘.’ (dot): Example:
FOO : 'f' . 'o';
The rule above would allow expressions like ‘foo’, ‘f0o’ or even ‘f\no’.
With the until token it is possible to state that everything should be consumed until a certain token occurs. The multiline comment is implemented this way:
terminal ML_COMMENT : '/*' -> '*/';
This is the rule for Java-style comments that begin with ‘/*’ and end with ‘*/’.
All the tokens explained above can be inverted using a preceding explanation mark:
terminal ML_COMMENT : '/*' (!'*/')+;
Rules can refer to other rules. This is done by writing the name of the rule to be called. We refer to this as rule calls. Rule calls in terminal rules can only point to terminal rules.
Example:
terminal QUALIFIED_NAME : ID ('.' ID)*;
Using alternatives one can state multiple different alternatives. For instance, the whitespace rule uses alternatives like this:
terminal WS : (' '|'\t'|'\r'|'\n')+;
That is a WS can be made of one or more whitespace characters (including ‘ ’,‘\t’,‘\r’,‘\n’).
The parser reads a sequence of terminals and walks through the parser rules. Hence a parser rule – contrary to a terminal rule – does not produce a single terminal token but a tree of non-terminal and terminal tokens. They lead to a so called parse tree (in Xtext it is also referred as node model). Furthermore, parser rules are handled as kind of a building plan for the creation of the EObjects that form the semantic model (the linked abstract syntax graph or AST). Due to this fact, parser rules are even called production rules. The different constructs like actions and assignments are used to derive types and initialize the semantic objects accordingly.
Not all the expressions that are available in terminal rules can be used in parser rules. Character ranges, wildcards, the until token and the negation are only available for terminal rules.
The elements that are available in parser rules as well as in terminal rules are
keywords and
In addition to these elements, there are some expressions used to direct how the AST is constructed, which are listed and explained in the following.
Assignments are used to assign the parsed information to a feature of the current object. The type of the current object, its EClass, is specified by the return type of the parser rule. If it is not explicitly stated it is implied that the type’s name equals the rule’s name. The type of the feature is infered from the right hand side of the assignment.
Example:
State :
'state' name=ID
('actions' '{' (actions+=[Command])+ '}')?
(transitions+=Transition)*
'end'
;
The syntactic declaration for states in the state machine example starts with a keyword ‘state’ followed by an assignment:
name=ID
The left hand side refers to a feature '
name
' of the current object (which has the EClass ‘State’ in this case).
The right hand side can be a rule call, a keyword, a cross reference (explained later) or even an alternative comprised by the former.
The type of the feature needs to be compatible with the type of the expression on the right. As ID returns an EString in this case, the feature '
name
' needs to be of type EString as well.
Assignment Operators
There are three different assignment operators, each with different semantics.
The simple equal sign “=” is the straight forward assignment, and used for features which take only one element.
The “+=” sign (the add operator) expects a multi valued feature and adds the value on the right hand to that feature, which is a list feature.
The “?=” sign (boolean assignment operator) expects a feature of type EBoolean and sets it to true if the right hand side was consumed independently from the concrete value of the right hand side.
The used assignment operator does not effect the cardinality of the expected symbols on the right hand side.
A unique feature of Xtext is the ability to declare crosslinks in the grammar. In traditional compiler construction the crosslinks are not established during parsing but in a later linking phase. This is the same in Xtext, but we allow to specify crosslink information in the grammar. This information is used by the linker. The syntax for crosslinks is:
CrossReference :
'[' type=TypeRef ('|' ^terminal=CrossReferenceableTerminal )? ']'
;
For example, the transition is made up of two cross references, pointing to an event and a state:
Transition :
event=[Event] '=>' state=[State]
;
It is important to understand that the text between the square brackets does not refer to another rule, but to a type! This is sometimes confusing, because one usually uses the same name for the rules and the returned types. That is if we had named the type for events differently like in the following the cross reference needs to be adapted as well:
Transition :
event=[MyEvent] '=>' state=[State]
;
Event returns MyEvent : ....;
Looking at the syntax definition of cross references, there is an optional part starting with a vertical bar (pipe) followed by ‘CrossReferenceableTerminal’. This is the part describing the concrete text, from which the crosslink later should be established. If the terminal is omitted, it is expected to be an ID.
You may even use alternatives as the referencable terminal. This way, either an ID or a STRING may be used as the referencable terminal, as it is possible in many SQL dialects.
TableRef: table=[Table|(ID|STRING)];
Have a look at the linking section in order to understand how linking is done.
By default the object to be returned by a parser rule is created lazily on the first assignment. Then the type of the EObject to be created is determined from the specified return type or the rule name if no explicit return type is specified. With Actions however, the creation of returned EObject can be made explicit. Xtext supports two kinds of Actions:
simple actions, and
assigned actions.
If at some point you want to enforce the creation of a specific type you can use alternatives or simple actions.
In the following example TypeB must be a subtype of TypeA. An expression
A ident
should create an instance of TypeA, whereas
B ident
should instantiate TypeB.
Example with alternatives:
MyRule returns TypeA :
"A" name=ID |
MyOtherRule
;
MyOtherRule returns TypeB :
"B" name = ID
;
Example with simple actions:
MyRule returns TypeA :
"A" name=ID |
"B" {TypeB} name=ID
;
Generally speaking, the instance is created as soon as the parser hits the first assignment. However, actions allow to explicitly instantiate any EObject. The notation
{TypeB}
will create an instance of TypeB and assign it to the result of the parser rule. This allows parser rules without any assignment and object creation without the need to introduce unnecessary rules.
We previously explained, that the EObject to be returned is created lazily when the first assignment occurs or when a simple action is evaluated. There is another way one can set the EObject to be returned, which we call an “unassigned rule call”.
Unassigned rule calls (the name suggests it) are rule calls to other parser rules, which are not used within an assignment. If there is no feature the returned value shall be assigned to, the value is assigned to the “to-be-returned” result of the calling rule.
With unassigned rule calls one can, for instance, create rules which just dispatch between several other rules:
AbstractToken :
TokenA |
TokenB |
TokenC
;
As
AbstractToken
could possibly return an instance of
TokenA
,
TokenB
or
TokenC
its type must by a super type of these types.
It is now for instance as well possible to further change the state of the AST element by assigning additional things.
Example:
AbstractToken :
( TokenA |
TokenB |
TokenC ) (cardinality=('?'|'+'|'*'))?
;
This way the ‘cardinality’ is optional (last question mark) and can be represented by a question mark, a plus, or an asterisk. It will be assigned to either an EObject of type
TokenA
,
TokenB
, or
TokenC
which are all subtypes of
AbstractToken
. The rule in this example will never create an instance of
AbstractToken
directly.
LL parsing has some significant advantages over LR algorithms. The most important ones for Xtext are, that the generated code is much simpler to understand and debug and that it is easier to recover from errors. Especially ANTLR has a very nice generic error recovery mechanism. This allows to construct an AST even if there are syntactic errors in the text. You wouldn’t get any of the nice IDE features as soon as there is one little error, if we hadn’t error recovery.
However, LL also has some drawbacks. The most important one is that it does not allow left recursive grammars. For instance, the following is not allowed in LL based grammars, because “Expression '+' Expression” is left recursive:
Expression :
Expression '+' Expression |
'(' Expression ')'
INT
;
Instead one has to rewrite such things by “left-factoring” it:
Expression :
TerminalExpression ('+' Expression)?
;
TerminalExpression :
'(' Expression ')' |
INT
;
In practice this is always the same pattern and therefore not that problematic. However, by simply applying the Xtext AST construction features we’ve covered so far, a grammar ...
Expression :
{Operation} left=TerminalExpression (op='+' right=Expression)?
;
TerminalExpression returns Expression:
'(' Expression ')' |
{IntLiteral} value=INT
;
... would result in unwanted elements in the AST. For instance the expression “ ( 42 ) ” would result in a tree like this:
Operation {
left=Operation {
left=IntLiteral {
value=42
}
}
}
Typically one would only want to have one instance of
IntLiteral
instead.
One can solve this problem using a combination of unassigned rule calls and assigned actions:
Expression :
TerminalExpression ({Operation.left=current}
op='+' right=Expression)?
;
TerminalExpression returns Expression:
'(' Expression ')' |
{IntLiteral} value=INT
;
In the example above
{Operation.left=current}
is a so called tree rewrite action, which creates a new instance of the stated EClass (Operation in this case) and assigns the element currently to-be-returned (
current
variable) to a feature of the newly created object (in this case feature ‘left’ of the Operation instance).
In Java these semantics could be expressed as:
Operation temp = new Operation();
temp.setLeft(current);
current = temp;
Because parser rules describe not a single token, but a sequence of patterns in the input, it is necessary to define the interesting parts of the input. Xtext introduces the concept of hidden tokens to handle semantically unimportant things like whitespaces, comments, etc. in the input sequence gracefully. It is possible to define a set of terminal symbols, that are hidden from the parser rules and automatically skipped when they are recognized. Nevertheless, they are transparently woven into the node model, but not relevant for the semantic model.
Hidden terminals may (or may not) appear between any other terminals in any cardinality. They can be described per rule or for the whole grammar. When
reusing a single grammar its definition of hidden tokens is reused as well. The grammar
org.eclipse.xtext.common.Terminals
comes with a reasonable default and hides all comments and whitespace from the parser rules.
If a rule defines hidden symbols, you can think of a kind of scope that is automatically introduced. Any rule that is called from the declaring rule uses the same hidden terminals as the calling rule, unless it defines other hidden tokens itself.
Person hidden(WS, ML_COMMENT, SL_COMMENT):
name=Fullname age=INT ';'
;
Fullname:
(firstname=ID)? lastname=ID
;
The sample rule “Person” defines multiline comments (ML_COMMENT), single-line comments (SL_COMMENT), and whitespace (WS) to be allowed between the ‘Fullname’ and the ‘age’. Because the rule ‘Fullname’ does not introduce another set of hidden terminals, it allows the same symbols to appear between ‘firstname’ and ‘lastname’ as the calling rule ‘Person’. Thus, the following input is perfectly valid for the given grammar snippet:
John /* comment */ Smith // line comment
/* comment */
42 ; // line comment
A list of all default terminals like WS can be found in section Grammar Mixins.
Data type rules are parsing-phase rules, which create instances of EDataType instead of EClass. Thinking about it, one may discover that they are quite similar to terminal rules. However, the nice thing about data type rules is that they are actually parser rules and are therefore
context sensitive and
allow for use of hidden tokens.
If you, for instance, want to define a rule to consume Java-like qualified names (e.g. “foo.bar.Baz”) you could write:
QualifiedName :
ID ('.' ID)*
;
This looks similar to the terminal rule we’ve defined above in order to explain rule calls. However, the difference is that because it is a parser rule and therefore only valid in certain contexts, it won’t conflict with the rule ID. If you had defined it as a terminal rule, it would possibly hide the ID rule.
In addition when this has been defined as a data type rule, it is allowed to use hidden tokens (e.g. “/* comment **/”) between the IDs and dots (e.g.
foo/* comment */. bar . Baz"
)
Return types can be specified in the same way as in terminal rules:
QualifiedName returns ecore::EString :
ID ('.' ID)*
;
Note that if a rule does not call another parser rule and does neither contain any actions nor assignments, it is considered to be a data type rule and the data type EString is implied if none has been explicitly declared.
For conversion again value converters are responsible (cf. value converters).
Enum rules return enumeration literals from strings. They can be seen as a shortcut for data type rules with specific value converters. The main advantage of enum rules is their simplicity, type safety and therefore nice validation. Furthermore it is possible to infer enums and their respective literals during the Ecore model transformation.
If you want to define a ChangeKind org.eclipse.emf.ecore.change/model/Change.ecore with ‘ADD’, ‘MOVE’ and ‘REMOVE’ you could write:
enum ChangeKind :
ADD | MOVE | REMOVE
;
It is even possible to use alternative literals for your enums or reference an enum value twice:
enum ChangeKind :
ADD = 'add' | ADD = '+' |
MOVE = 'move' | MOVE = '->' |
REMOVE = 'remove' | REMOVE = '-'
;
Please note, that Ecore does not support unset values for enums. If you formulate a grammar like
Element: "element" name=ID (value=SomeEnum)?;
with the input of
element Foo
the resulting value of the element
Foo
will hold the enum value with the internal representation of ‘0’ (zero). When generating the EPackage from your grammar this will be the first literal you define. As a workaround you could introduce a dedicated none-value or order the enums accordingly. Note that it is not possible to define an enum literal with an empty textual representation.
enum Visibility:
package | private | protected | public
;
You can overcome this by modifying the infered Ecore model through a model to model transformation.