Serialization

Serialization is the process of transforming an EMF model to its textual representation. Thereby, serialization complements parsing and lexing.

In Xtext, the process of serialization is split into three steps:

  1. Matching the model elements with the grammar rules and creating a stream of tokens. This is done by the Parse Tree Constructor.

  2. Mixing existing hidden tokens (whitespaces, comments, etc.) into the token stream. This is done by the Hidden TOken Merger.

  3. Adding needed whitespaces or replacing all whitespaces using a Formatter.

Serialization is invoked when calling XtextResource.save(...). Furthermore, SerializerUtil provides resource-independent support for serialization.

The Contract

The contract of serialization is that when a model is serialized to its textual representation and then loaded (parsed) again, the loaded model equals the original model. Please be aware that this does not imply, that when loading a textual representation and serializing it again that both textual representations equal each other. For example, this is the case when a default value is used in a textual representation and the assignment is optional. Another example is:

MyRule:
  (xval+=ID | yval+=INT)*;
  

MyRule in this example reads ID- and INT-elements which may occur in an arbitrary in the textual representation. However, when serializing the model all ID-elements will be written first and then all INT-elements. If the order is important it can be preserved by storing all elements in the same list – which may require wrapping the ID- and INT-elements into objects.

Parse Tree Constructor

The Parse Tree Constructor usually doesn’t need to be customized since it is automatically derived from the Xtext Grammar. However, it can be a good idea to look into it to understand its error messages and its runtime performance.

For serialization to succeed, the Parse Tree Constructor must be able to consume every element of the to-be-serialized EMF model. To consume means, in this context, to write it to the textual representation of the model. This can turn out to be a not-so-easy to fulfill requirement, since a Grammar usually introduces implicit constraints to the Ecore model. Example:

MyRule:
  (sval+=ID ival+=INT)*;

This example introduces the constraint sval.size() == ival.size(). Models which violate this constraint are valid EMF models, but they can not be serialized. To check whether a model complies with all constraints introduced by the grammar, there is currently only the way to invoke the Parse Tree Constructor. If this changes at some day, there will be news in bugzilla 239565.

For the Parse Tree Constructor, this can lead to the scenarios, that

To understand error messages and performance issues of the Parse Tree Constructor it is important to know that it implements a backtracking approach. This basically means that the grammar is used to specify the structure of a tree in which one path (from the root node to a leaf node) is a valid serialization of a specific model. The Parse Tree Constructor’s task is to find this path – with the condition, that all model elements are consumed while walking this path. The Parse Tree Constructor’s strategy is to take the most promising branch first (the one that would consume the most model elements). If the branch leads to a dead end (for example, if a model element needs to be consumed that is not present in the model), the Parse Tree Constructor goes back the path until a different branch can be taken. This behavior has two consequences:

Transient Values

Transient Values are values or model elements which are not persisted (written to the textual representation in the serialization phase). If a model contains model elements which can not be serialized with the current grammar, it is critical to mark them transient using the ITransientValueService, or serialization will fail. The default implementation marks all model elements transient, that are unset or equal their default value.

Unassigned Text

Unassigned Text are data rule calls or terminal rule calls which do not reside within an association. Example:

PluralRule:
  'contents:' count=INT Plural;
  
terminal Plural: 
  'item' | 'items';
  

Valid DSL-Scripts for this example are contents 1 item or contents 5 items. However, it is not stored in the semantic model whether the keyword item or items has been parsed. This is due to the fact that the rule call Plural is unassigned. However, the Parse Tree Constructor needs a decision which value to write during serialization. This decision can be be made by implementing the IUnassignedTextSerializer.

Cross Reference Serializer

The Cross Reference Serializer specifies which values are to be writting to the textual representation for cross references. This behavior can be customized by implementing ICrossReferenceSerializer. The default implementation delegates to ILinkingService, which may be the better place for customization.

Hidden Token Merger

After the Parse Tree Constructor has done its job to create a stream of tokens which are to be written to the textual representation, the Hidden Token Merger ( IHiddenTokenMerger) mixes existing hidden tokens into this token stream. The default implementation uses the hidden tokens (whitespaces, linebreaks, comments) from the node model. The IHiddenTokenMerger is the factory for a “Token Stream”#tokenstream which is fed by the Parse Tree Constructor and which writes to another Token Stream.

Token Stream

The Parse Tree Constructor, the Hidden Token Merger and the Formatter use Token Streams for their output, and the latter two for their input as well. This makes them chainable. Token Streams can be converted to String using the TokenStringBuffer and to java.io.OutputStream using the TokenOutputStream. Maybe there will be an implementation to reconstruct a node model as well at some point in the future. While providing fast output due to the stream pattern, Token Streams allow easy manipulation of the stream, such as mixing in whitespaces or manipulating them.

public interface ITokenStream {
  public void close() throws IOException;
  public void writeHidden(EObject grammarElement, String value) throws IOException;
  public void writeSemantic(EObject grammarElement, String value) throws IOException;
}