Structured Data Formats

Ogg-Vorbis format, 10Mbytes MP3, 10Mbytes WAV format, 115Mbytes
Ogg-Vorbis format, 5Mbytes MP3, 5Mbytes WAV format, 55Mbytes

Mutual agreement

If the server decides to send e.g an int, a byte and a null terminated string, then the client must be able to read them in this form. Mutual agreement means that the client and the server must have agreed beforehand on the format of the messages. No information about the format is exchanged, because both client and server know what to expect.

This method is commonly used. e.g. by most Remote Procedure Call protocols. There is no overhead of redundant information. Only the necessary data is sent.

Mutual agreement is not type-safe. If the server sends an int instead of a byte, then the client will just get the wrong data. It may even crash.

Safer messages include some sort of type information as part of the message.

Internet Mail Format

The Internet Mail Format is a character based format that has been extended in many ways past its original use. For example, it forms the underlying representation of the Web.

Header Format

A mail message consists of header information followed by the data body. The header may contain an open-ended amount of information (for mail, From, To, Subject, Date, Sender, CC, References, ...). This makes it useful for any other text-based protocol.

The header consists of an indefinite number of (logical) lines. The header is terminated by a blank line. Each line ends in CR_LF.

From: jan
To: you

// body starts after the blank line


(Multipurpose Internet Mail Extensions). This is designed for two purposes: firstly to allow messages composed of multiple parts (e.g. an archive of messages), and secondly to handle non-ASCII data.

Extra fields are added to a message header field:

    Content-Type:  <toplevel-type/specific-type>
    Content-Transfer-Encoding: <encoding>
    Content-Length: <length>   // used by HTTP, not mail
Each message is terminated by a special string. HTTP uses this format, but adds a length field instead of a special string.

The standard toplevel types are application, audio, image, message, multipart, text, video. All non-standard types must begin with x-, e.g. x-compress. For each toplevel type there is a set of minor types, such as image/jpeg, image/gif. Non-standard minor types must also begin with x-, such as image/x-portable-bitmap.

The encoding is to tell whether it is sent in e.g.7bit, case-insensitive, quoted-printable, etc.


Content-Type: APPLICATION/zip
Content-Transfer-Encoding: BASE64

Abstract Syntax Notation 1 (ASN.1)

ASN.1 is designed as "self describing data". ASN.1 has a number of basic and derived data types:

A set of data is then given as a tagged sequence of data e.g.


Then each tag is given a value as a byte, and the primitive values are encoded. The rules for this are totally obscure, so that even elementary tutorials are nearly impossible to read.

The result is a sequence of bytes that can be decoded by an ASN.1 parser into a set/sequence of data values of the correct types. This is a byte-format method


XML is the current favourite for representation of structured data. It is a character-format method, and simple XML documents are human-readable.

XML allows you to define your own tags as strings. Tags may be nested. To each begin tag is an end tag, except for empty tags. Tags may also possess attributes to give more information about the tag

For example, a login message sent from client to server may be

The reply may be one of
    <login-reply status="succeeded" />
    <login-reply status="failed" />

A document for directory replies could be

        <file-count> 3 </file-count>
        <filename> abc.txt </filename>
        <filename> def.doc </filename>
        <filename> </filename>
(NB: </filename> cannot be a legal filename!)

An XML tutorial is at

Document Type Definitions (DTDs)

An XML document can be given a formal specification. There are several methods

The DTD method was the first and is still widely used. It is good at specifying document structure. It is not very good at specifying data types. XML Schema is good at specifying data types, but not so good at structure.

A DTD for login-requests is

<!ELEMENT login-request (name, password)>
<!ELEMENT password (#PCDATA)>
This says that a login-request must contain a name and a password, and that their values are both strings (#PCDATA).

A DTD for directory replies is

<!ELEMENT dir-reply (file-count, filename+)>
<!ELEMENT file-count (#PCDATA)>
<!ELEMENT filename (#PCDATA)>

An XML DTD tutorial is at

XML Schema

DTD's are not good at data-types since they only use #PCDATA which means "any text". You can't talk about things like integers, dates or arrays. XML Schema attempts to fix this, but while it is good at data-types, it is not so good at document structure.

A schema for login-requests is

<xsd:schema xmlns:xsd="">

<xsd:element name="loginrequest" type="LoginRequest"/>

<xsd:complexType name="LoginRequest"/>
        <xsd:element name="name" type="xsd:string"/>
        <xsd:element name="password" type="xsd:string"/>


XML Schema data types


There are a number of ways of handling XML parsing in Java

We will use JAXB: a mechanism for

JAXB "early release" used DTDs. JAXB release 1.0 switched to schema. JAXB 1.0 requires JDK 1.4.1 or later.

The JAXB compiler is xjc and is found in the jaxb/bin directory of Sun's Web services pack. It can be used by -p package_dir login.xsd
It generates a bunch of classes including the interface

public interface LoginRequest {

    java.lang.String getPassword();
    void setPassword(java.lang.String value);
    java.lang.String getName();
    void setName(java.lang.String value);


It also generates an ObjectFactory which includes methods

public class ObjectFactory
    extends impl.runtime.DefaultJAXBContextImpl

    public ObjectFactory() {
	// ...

    public java.lang.Object newInstance(java.lang.Class javaContentInterface)
        throws javax.xml.bind.JAXBException
	// ...

    public LoginRequest createLoginRequest()
        throws javax.xml.bind.JAXBException
	// ...
You need to use this factory to load the implementation of the interface - you are not supposed to access the implementation classes directly.

A test program is


import javax.xml.bind.*;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Unmarshaller;

import generated.LoginRequest;

public class TestLoginRequest{

    public static void main(String[] args){
	new TestLoginRequest();

    public TestLoginRequest() {

	InputStream in = null;
	try {
	    in = new FileInputStream(new File("login.xml"));
	} catch( e) {
	LoginRequest request = null;
	try {
	// tmp is the package name used in the -p option to xjc
	JAXBContext jc = JAXBContext.newInstance("generated");

	Unmarshaller u = jc.createUnmarshaller();
	JAXBElement obj = (JAXBElement<LoginRequest>) u.unmarshal(in);
	System.out.println("Type: " +obj.getClass().toString());

        Object requestObj = obj.getValue();
	request = (LoginRequest) requestObj;
        System.out.println("TYpe: " + requestObj.getClass().toString());

	} catch(JAXBException e) {

	System.out.println("Name is " + request.getName());
	System.out.println("Password is " + request.getPassword());

} // TestLoginRequest


Unicode Consortium The Unicode Standard ISBN 0-201-56788-1, QA 268.U55

D. H. Crocker Standard for the Format of ARPA Internet Text Messages IETF RFC 822

The IETF RFC's may be obtained from or

This page is maintained by Jan Newmarch
Copyright © Jan Newmarch, Monash University, 2007
Creative Commons License This work is licensed under a Creative Commons License
The moral right of Jan Newmarch to be identified as the author of this page has been asserted.