Structured Data Formats

Ogg-Vorbis format, 10Mbytes	MP3, 10Mbytes	WAV format, 115Mbytes
Ogg-Vorbis format, 5Mbytes	MP3, 5Mbytes	WAV format, 55Mbytes

Mutual agreement

If the server decides to send e.g an int, a byte and a null terminated string, then the client must be able to read them in this form. Mutual agreement means that the client and the server must have agreed beforehand on the format of the messages. No information about the format is exchanged, because both client and server know what to expect.

This method is commonly used. e.g. by most Remote Procedure Call protocols. There is no overhead of redundant information. Only the necessary data is sent.

Mutual agreement is not type-safe. If the server sends an int instead of a byte, then the client will just get the wrong data. It may even crash.

Safer messages include some sort of type information as part of the message.

Internet Mail Format

The Internet Mail Format is a character based format that has been extended in many ways past its original use. For example, it forms the underlying representation of the Web.

Header Format

A mail message consists of header information followed by the data body. The header may contain an open-ended amount of information (for mail, From, To, Subject, Date, Sender, CC, References, ...). This makes it useful for any other text-based protocol.

The header consists of an indefinite number of (logical) lines. The header is terminated by a blank line. Each line ends in CR_LF.

From: jan
To: you

// body starts after the blank line

MIME

(Multipurpose Internet Mail Extensions). This is designed for two purposes: firstly to allow messages composed of multiple parts (e.g. an archive of messages), and secondly to handle non-ASCII data.

Extra fields are added to a message header field:

    Content-Type:  <toplevel-type/specific-type>
    Content-Transfer-Encoding: <encoding>
    Content-Length: <length>   // used by HTTP, not mail

Each message is terminated by a special string. HTTP uses this format, but adds a length field instead of a special string.

The standard toplevel types are application, audio, image, message, multipart, text, video. All non-standard types must begin with x-, e.g. x-compress. For each toplevel type there is a set of minor types, such as image/jpeg, image/gif. Non-standard minor types must also begin with x-, such as image/x-portable-bitmap.

The encoding is to tell whether it is sent in e.g.7bit, case-insensitive, quoted-printable, etc.

Example:

Content-Type: APPLICATION/zip
Content-Transfer-Encoding: BASE64
 
UEsDBBQAAAAIAFF8ASs1oDxOHAMAAM0GAAAIABUAbmV3Lmh0bWxVVAkAA+mU
ZzsW7Wc7VXgEAPQB9AGVVe9v2zYQ/a6/4qYNWwNMYhS3GObJwrqtxZoNQ9ak
8MeBFs8SZ/7QSCqu//udKMmx3QLL8sUO+e7d4927c9kGrSooW+SiSsogg8Lq
F+mDk5s+oIA7ZxvHtZamAWnglj9yyGDd8vCNhz9wX7IxJklKJc0OWofbVerD
QaFvEUNee5+CQ3V6mFIqjYGD4RpXKe9Da10KtTUBTVilt9wM3Jq7ur3A7vCw
t074E/QForbdwcmmDSeQr4fDH05p4RjVhtBl+E8vH1fpPYastnYn8SR4ZXCf
...

Abstract Syntax Notation 1 (ASN.1)

ASN.1 is designed as "self describing data". ASN.1 has a number of basic and derived data types:

INTEGER
IA5SATRING (ASCII)
OCTET STRING
...
SEQUENCE
SET OF
...

A set of data is then given as a tagged sequence of data e.g.

SEQUENCE INTEGER 10 IA5STRING "hello"

Then each tag is given a value as a byte, and the primitive values are encoded. The rules for this are totally obscure, so that even elementary tutorials are nearly impossible to read.

The result is a sequence of bytes that can be decoded by an ASN.1 parser into a set/sequence of data values of the correct types. This is a byte-format method

XML

XML is the current favourite for representation of structured data. It is a character-format method, and simple XML documents are human-readable.

XML allows you to define your own tags as strings. Tags may be nested. To each begin tag is an end tag, except for empty tags. Tags may also possess attributes to give more information about the tag

For example, a login message sent from client to server may be

    <login-request>
        <name>
            newmarch
        </name>
        <password>
            abcdefg
        </password>
    </login-request>

The reply may be one of

    <login-reply status="succeeded" />

    <login-reply status="failed" />

A document for directory replies could be

    <dir-reply>
        <file-count> 3 </file-count>
        <filename> abc.txt </filename>
        <filename> def.doc </filename>
        <filename> ghi.java </filename>
    </dir-reply>

(NB: </filename> cannot be a legal filename!)

An XML tutorial is at http://www.w3schools.com/xml/default.asp

Document Type Definitions (DTDs)

An XML document can be given a formal specification. There are several methods

XML DTD
XML Schema

The DTD method was the first and is still widely used. It is good at specifying document structure. It is not very good at specifying data types. XML Schema is good at specifying data types, but not so good at structure.

A DTD for login-requests is

<!ELEMENT login-request (name, password)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT password (#PCDATA)>

This says that a login-request must contain a name and a password, and that their values are both strings (#PCDATA).

A DTD for directory replies is

<!ELEMENT dir-reply (file-count, filename+)>
<!ELEMENT file-count (#PCDATA)>
<!ELEMENT filename (#PCDATA)>

An XML DTD tutorial is at http://www.xmlfiles.com/dtd/

XML Schema

DTD's are not good at data-types since they only use #PCDATA which means "any text". You can't talk about things like integers, dates or arrays. XML Schema attempts to fix this, but while it is good at data-types, it is not so good at document structure.

A schema for login-requests is


<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<xsd:element name="loginrequest" type="LoginRequest"/>

<xsd:complexType name="LoginRequest"/>
    <xsd:sequence>
        <xsd:element name="name" type="xsd:string"/>
        <xsd:element name="password" type="xsd:string"/>
    </xsd:sequence>
</xsd:complexType>

</xsd:schema>

XML Schema data types

JAXB

There are a number of ways of handling XML parsing in Java

JAXP
JAXB
JDOM
JAXM
JAXR

We will use JAXB: a mechanism for

writing out Java objects as XML (marshalling)
creating Java objects from XML documents (unmarshalling)

JAXB "early release" used DTDs. JAXB release 1.0 switched to schema. JAXB 1.0 requires JDK 1.4.1 or later.

The JAXB compiler is xjc and is found in the jaxb/bin directory of Sun's Web services pack. It can be used by


    xjc.sh -p package_dir login.xsd

It generates a bunch of classes including the interface

It also generates an ObjectFactory which includes methods You need to use this factory to load the implementation of the interface - you are not supposed to access the implementation classes directly.

A test program is

References

Unicode Consortium The Unicode Standard ISBN 0-201-56788-1, QA 268.U55

D. H. Crocker Standard for the Format of ARPA Internet Text Messages IETF RFC 822

The IETF RFC's may be obtained from ftp://ietf.org/internet-drafts/ or http://www.garlic.com/~lynn/rfcietf.htm

This page is maintained by Jan Newmarch http://jan.newmarch.name