SYIMP and SYUMP

$Revision: 1.1.1.1 $, $Date: 2000/10/12 02:15:32 $

Introduction

This document describes two closely related messaging protocols, meant to be used as communication mechanism for some (we hope revolutionary) distributed internet technologies. The current working names of these protocols are:

SYIMP Simple Yet Illegible Messaging Protocol
SYUMP Simple Yet Understandable Messaging Protocol

They are basically meant to be the same protocols, but with a difference (can you guess the difference?). The first protocol is meant to be as efficient as possible and will thus be binary (and thus illegible). The second protocol is meant to have exactly the same functionality as the first but will use a human readable (ASCII) representation, and probably be a bit less efficient.
This difference is a result from not being able (yet) to decide which way is best to go. Both efficiency and human readability seem to be worthwile properties of such a low level protocol. Hence, I decided to describe both options as SYIMP and SYUMP, and let time tell which one will eventually be the SYAMPion. Of course it will also be a possibility that both protocols will be comparable in popularity and that with a proper abstraction layer one could generate and parse both.

Goals

The messaging protocols are designed to be the communication protocol for a new generation of ditributed applications. In fact, in the future we hope that we might use various different types of protocols for these kind of applications, but for starters we think a new specialized protocol will provide optimal flexibility and ease of use. The protocol should support:

extensive message passing, since the distributed application will be heavily asynchronous.
language independent
generic (make little assumptioms fo what is needed)
flexible/extendable (make it easy to add new properties or features)
simple, easy to understand

The overall design is simple yet powerful. All messages are sent over a tcp connection (in future a ssl version might be defined also, for security reasons). Each message consists of

some very generic housekeeping data (e.g. length of message)
tagged header fields
an application specific message body

The flexibility of the message is in the use of tagged header fields. This makes it possible to only use those fields that are necessary for a certain type of message. Furthermore, it makes it easy to add new header fields if new features are to be supported. It is important to note the distinction between the header fields and the the message body. Although the message body can be used to store the same information as in the header fields, this is not recommended. The importance of defining header fields in the protocol is to make it possible for the infrastructure (i.e. an implementation of the protocol) to handle this information appropriatly, and not to burden the application with such protocol details.
For example, there is a timestamp tag, which corresponding field contains the time the message was sent. This field is optional, since many applications will not need this information, and could lose the extra overhead. Those applications that do need the information can configure the message sending code to automatically add this tag and fill the field. Alternatively, while debugging, or troubleshooting network problems, a person, might turn timestamping on for all messages.
The central theme in this design is the separation between messaging infrastructure and the application. The messaging infrastructure deals with the header fields, and passes the message body on to the application. The application deals with the message body, and instructs the messaging infrastructure what options to use. The exact border between application and infrastructure will need top be determined by trial and error.

Tag-sets a.k.a. interfaces

In a attempt to make the messaging protocols a very generic basis for varying types of applications, a flexible interface definition mechanism is defined. In the header, before any application specific tags appear, first one special tag appears. This tag indicates the kind of header information that follows. Furthermore, one message can contain information from several different tagsets. This makes it easy to build different layers of protocols (tag-sets) on top of existing tag-sets.
For example, a message starts first with a very generic messaging tag-set tag with it's generic tags. Then a new tag-set tag follows, with a more application specific tag-set.
At this moment I decide only to define a tag-set tag, and not a second tag-set-version tag. If a tag-set evolves over time, this can be indicated by using the version number as part of the tag-set name. Alternatively, the tag-set itself can define a version tag, to indicate the version of the tag-set.

SYIMP

SYIMP is the binary and efficient version of the above principles. Each message has the following layout:

32-bit total-message length in 8-bit bytes
32-bit message-body offset in 8-bit bytes (will be at least 8, to account for these first two length fields)
a varying number of tagged fields, each with the following lay-out:
1. 16-bit tag indentifier
2. 16-bit field value length
3. bytes containing the value of the field, as indicated in the previous item.
The hexadecimal tag #0000 has a special meaning, and indicates the start of a new tag-set. The value of this tag is a variable number of bytes (preferably a mutlipe of 4), as indicated in the 16 bits value length. The value can be any series of bytes, but should preferably be restricted to alphanumeric characters, and a few special characters such as the hyphen, a dot etc.
the message body with a length of item 1 - item 2.

Note: It is recommended that all tag-sets reserve the tag-id #0001 for a version tag.

SYAMP

SYAMP is the ASCII cousin of SYIMP. SYAMP will consist of the following parts:

The header with all fields. Each field has the following layout:
1. a symbolic tag name
2. a colon and a space
3. the value
4. a final carriage return (a optional linefeed will be ignored)
The special tag TAGSET, defines a new tagset.
a blank line
a line containing the length of the following body in decimal notation, ending with a colon and a space
the body as an array of bytes.

Note that the message body will start on the same line as the length, directly after the colon and space. This is done to prevent confusion around the CR and CR+LF options to start a new line. If the LF is optional then there is no easy way to determine if the first character after the CR is part of the message or the optional LF.
Note: It is recommended that all tag-sets reserve the tag-name "Version" for a version tag.
Open Questions:

Is the space after the colon useful for readability? For the tagged fields it might be optional, and there might be more than one space (before and after the colon), but for the message body this needs to be clear. The exact whitespace rules need some more work.
Will the message body be encoded as ASCII characters or will binary characters be allowed too. Probably we want at least 2 different encoding schemes, ASCII (with possible special escape sequences for non-printable characters). And BINARY with some kind of encoding (BASE-64, hexadecimal, or just raw bytes). These type will be signaled together with the message length field (and we need to decide what kind of length is used). Clearly needs some more work too.
case sensitive????
is it useful to optionally allow the SJIMP tag number between parentheses before the colon, e.g.
Timestamp (#1A):

The standard MSG Tag-set

Although anyone is free to define new tag-sets, we define a tag-set here that might be useful for any generic messaging protocol. The following table defines the message tags. It describes both the binary tag values as the ASCII tag names, and the format of both types of values

tag-name	tag-id	format	comment
#0000	Tag-set	"MSG "	the standard MSG tag-set
#0001	Version	t.b.d. 32 bits	the version of the tag-set
#0010	Timestamp	t.b.d.	The time the message was sent
#0011	MessageId	32 bit integer	a unique id for the tcp channel for the message
#0012	CorrelationId	32 bit integer	an id referring back to a message to which this is a reply

An example

In this example we will first define a second tag-set is included as example of using multiple tagsets. Then we will show a message using this tag-set for both a SYIMP and a SYUMP message.

an example tag-set

tag-name	tag-id	format	comment
#0000	Tag-set	"MOB "	The MOB (Messaging Object) tag-set
#0010	DestinationMob	32 bit integer	specific tags for a certain tagset, start numbering above #1000
#0011	ReplyToMob	32 bit integer	the id of the Mob to which a reply should be sent

Note: This tag-set uses some identical tag-id's (at least binary ones) as the standard MSG tag-set. This is perfectly legal, and in fact the main motivation to add a special tag-set tag. Tag's can be considered to be in a separate namespace of the tag-set whose id or name immediatly precedes them. This makes it easy not to worry about conflicting tag-id's and tag-names with other tag-sets. (But there is still a theoretical danger of conflicting tag-set names themselves)

An example SYIMP message

byte-number	byte1	byte2	byte3	byte4	comment
#0000	#00	#00	#00	#3A	total length (#0000003A)
#0004	#00	#00	#00	#30	body start position (#00000030)
#0008	#00	#00	#00	#04	tagset tag (#0000) with 4 byte value
#000B	#4D	#53	#47	#00	MSG (padded with a null character)
#0010	#00	#11	#00	#04	MessageId tag (#0011) with 4-byte value
#0014	#12	#34	#56	#78	MessageId value
#0018	#00	#00	#00	#04	tagset tag (#0000) with 4-byte value
#001B	#4D	#4F	#42	#00	MOB (padded with 0 character)
#0020	#00	#10	#00	#04	DestinationMob tag (#0010) with 4 byte value
#0024	#00	#00	#17	#32	DestinationMob value (#00001732)
#0028	#00	#11	#00	#04	ReplyToMob tag (#0011) with 4 byte value
#002B	#00	#00	#31	#73	ReplyToMob value (#00003173)
#0030	#48	#45	#4C	#4C	BODY of message: "HELL"
#0034	#4F	#20	#57	#4F	BODY of message: "O WO"
#0038	#52	#4C	#44		BODY of message: "RLD"

An example SYUMP message

Here comes the same message but then in SYUMP format:

Tagset: MSG
MessageId: #12345678
Tagset: MOB
DestinationMob: #00001732
ReplyToMob: #00003173

11:HELLO WORLD

Note: This message is 108 bytes long. The previous binary SYIMP message was only(?) 58 bytes. Furthermore, messages of ASCII type need more complex marshalling and parsing, especially of number values. Which case is more effective can depend on the specific application, but the choice between functionally equivalent protocols might make it easier to switch between them.

TODO

This description is far from complete yet. Among the things to be done are:

An opening byte sequence to recognize the protocol.
A precise definition of each value format
A rethink of the exact syntax of the SYUMP headers etc. (see also the open questions in the SYUMP section)
A rethink of the standard MSG tagset, and if we actually need to define such a standard tagset (seems useful though)
better names? (especially for SYIMP, SYUMP, tag-set)
Split the document in a explanatory part and an exact definition part?
Nicer layout of document, and especially of the tables.