Serialization is the process of transforming an EMF model to its textual representation. Thereby, serialization complements parsing and lexing.
In Xtext, the process of serialization is split into three steps:
Matching the model elements with the grammar rules and creating a stream of tokens. This is done by the Parse Tree Constructor.
Mixing existing hidden tokens (whitespaces, comments, etc.) into the token stream. This is done by the Hidden TOken Merger.
Adding needed whitespaces or replacing all whitespaces using a Formatter.
Serialization is invoked when calling
XtextResource.save(...)
. Furthermore,
SerializerUtil
provides resource-independent support for serialization.
The contract of serialization is that when a model is serialized to its textual representation and then loaded (parsed) again, the loaded model equals the original model. Please be aware that this does not imply, that when loading a textual representation and serializing it again that both textual representations equal each other. For example, this is the case when a default value is used in a textual representation and the assignment is optional. Another example is:
MyRule:
(xval+=ID | yval+=INT)*;
MyRule
in this example reads
ID
- and
INT
-elements which may occur in an arbitrary in the textual representation. However, when serializing the model all
ID
-elements will be written first and then all
INT
-elements. If the order is important it can be preserved by storing all elements in the same list – which may require wrapping the
ID
- and
INT
-elements into objects.
The Parse Tree Constructor usually doesn’t need to be customized since it is automatically derived from the Xtext Grammar. However, it can be a good idea to look into it to understand its error messages and its runtime performance.
For serialization to succeed, the Parse Tree Constructor must be able to consume every element of the to-be-serialized EMF model. To consume means, in this context, to write it to the textual representation of the model. This can turn out to be a not-so-easy to fulfill requirement, since a Grammar usually introduces implicit constraints to the Ecore model. Example:
MyRule:
(sval+=ID ival+=INT)*;
This example introduces the constraint
sval.size() == ival.size()
. Models which violate this constraint are valid EMF models, but they can not be serialized. To check whether a model complies with all constraints introduced by the grammar, there is currently only the way to invoke the Parse Tree Constructor. If this changes at some day, there will be news in
bugzilla 239565.
For the Parse Tree Constructor, this can lead to the scenarios, that
a model element can not be consumed. This can have the following reasons/solutions:
The model element should not be stored in the model.
The grammar needs an assignment which would consume the model element.
The Transient Value service could be used to indicate that this models element should not be consumed.
an assignment in the grammar has no corresponding model element. The Parse Tree Constructor considers a model element not to be present if it is unset or equals its default value. However, the parse tree constructor may serialize default values if this is required by a grammar constraint to be able to serialize another model element. The following solution may help to solve such a scenario:
A model element is missing in the model.
The assignment in the grammar should be made optional.
To understand error messages and performance issues of the Parse Tree Constructor it is important to know that it implements a backtracking approach. This basically means that the grammar is used to specify the structure of a tree in which one path (from the root node to a leaf node) is a valid serialization of a specific model. The Parse Tree Constructor’s task is to find this path – with the condition, that all model elements are consumed while walking this path. The Parse Tree Constructor’s strategy is to take the most promising branch first (the one that would consume the most model elements). If the branch leads to a dead end (for example, if a model element needs to be consumed that is not present in the model), the Parse Tree Constructor goes back the path until a different branch can be taken. This behavior has two consequences:
In case of an error, the Parse Tree Constructor has found only dead ends but no leaf. It can not tell which dead end is actually erroneous. Therefore, the error message lists dead ends of the longs paths, a fragment of their serialization and the reason why the path could not be continued at this point. The developer has to judge on his own which reason is the actual error.
For reasons of performance, it is critical that the Parse Tree Constructor takes the right branch first and detects wrong branches early. One way to archive this is to avoid having many rules which return the same type and which are called from within the same grammar-alternative.
Transient Values are values or model elements which are not persisted (written to the textual representation in the serialization phase). If a model contains model elements which can not be serialized with the current grammar, it is critical to mark them transient using the
ITransientValueService
, or serialization will fail. The default implementation marks all model elements transient, that are
unset or equal their default value.
Unassigned Text are data rule calls or terminal rule calls which do not reside within an association. Example:
PluralRule:
'contents:' count=INT Plural;
terminal Plural:
'item' | 'items';
Valid DSL-Scripts for this example are
contents 1 item
or
contents 5 items
. However, it is not stored in the semantic model whether the keyword
item
or
items
has been parsed. This is due to the fact that the rule call
Plural
is unassigned. However, the
Parse Tree Constructor needs a decision which value to write during serialization. This decision can be be made by implementing the
IUnassignedTextSerializer
.
The Cross Reference Serializer specifies which values are to be writting to the textual representation for cross references. This behavior can be customized by implementing
ICrossReferenceSerializer
. The default implementation delegates to
ILinkingService
, which may be the better place for customization.
After the
Parse Tree Constructor has done its job to create a stream of tokens which are to be written to the textual representation, the Hidden Token Merger (
IHiddenTokenMerger
) mixes existing hidden tokens into this token stream. The default implementation uses the hidden tokens (whitespaces, linebreaks, comments) from the node model. The
IHiddenTokenMerger
is the factory for a “Token Stream”#tokenstream which is fed by the
Parse Tree Constructor and which writes to another Token Stream.
The
Parse Tree Constructor, the
Hidden Token Merger and the
Formatter use Token Streams for their output, and the latter two for their input as well. This makes them chainable. Token Streams can be converted to
String
using the
TokenStringBuffer
and to
java.io.OutputStream
using the
TokenOutputStream
. Maybe there will be an implementation to reconstruct a node model as well at some point in the future. While providing fast output due to the stream pattern, Token Streams allow easy manipulation of the stream, such as mixing in whitespaces or manipulating them.
public interface ITokenStream {
public void close() throws IOException;
public void writeHidden(EObject grammarElement, String value) throws IOException;
public void writeSemantic(EObject grammarElement, String value) throws IOException;
}