XML

Introduction

XML is now a widespread way of representing complex data structures serialised into text format. It is used to describe documents such as DocBook and XHTML. It is used in specialised markup languages such as MathML and CML (Chemistry Markup Language). It is used to encode data as SOAP messages for Web Services, and the Web Service can be specified using WSDL (Web Services Description Language).

At the simplest level, XML allows you to define your own tags for use in text documents. Tags can be nested and can be interspersed with text. Each tag can also contain attributes with values. For example,

<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>
    

The structure of any XML can be described in a number of ways:

There is argument over the relative value of each way of defining the structure of an XML document. We won't buy into that, as Go does not suport any of them. Go cannot check for validity of any document against a schema, but only for well-formedness.

Four topics are discussed in this chapter: parsing an XML stream, marshalling and unmarshalling Go data into XML, and XHTML.

Parsing XML

Go has an XML parser which is created using NewParser. This takes an io.Reader as parameter and returns a pointer to Parser. The main method of this type is Token which returns the next token in the input stream. The token is one of the types StartElement, EndElement, CharData, Comment, ProcInst or Directive.

The types are

StartElement
The type StartElement is a structure with two field types:
type StartElement struct {
    Name Name
    Attr []Attr
}

type Name struct {
    Space, Local string
}

type Attr struct {
    Name  Name
    Value string
}
	
EndElement
This is also a structure
type EndElement struct {
    Name Name
}
	
CharData
This type represents the text content enclosed by a tag and is a simple type
type CharData []byte
	
Comment
Similarly for this type
type Comment []byte
	
ProcInst
A ProcInst represents an XML processing instruction of the form <?target inst?>
type ProcInst struct {
    Target string
    Inst   []byte
}
	
Directive
A Directive represents an XML directive of the form <!text>. The bytes do not include the <! and > markers.
type Directive []byte
	

A program to print out the tree structure of an XML document is


/* Parse XML
*/

package main

import ("fmt"; "xml"; "io/ioutil"; "os"; "strings")

func main() {
       if len(os.Args) != 2 {
                fmt.Println("Usage: ", os.Args[0], "file")
                os.Exit(1)
        }
        file := os.Args[1]
        bytes, err := ioutil.ReadFile(file)
        checkError(err)
        r := strings.NewReader(string(bytes))

	parser := xml.NewParser(r)
	depth := 0
	for {
		token, err := parser.Token()
		if err != nil {
			break
		}
		switch t := token.(type) {
		case xml.StartElement: 
			elmt := xml.StartElement(t)
			name := elmt.Name.Local
			printElmt(name, depth)
			depth++
		case xml.EndElement: 
			depth--
			elmt := xml.EndElement(t)
			name := elmt.Name.Local
			printElmt(name, depth)
		case xml.CharData: 
			bytes := xml.CharData(t)
			printElmt("\"" + string([]byte(bytes)) + "\"", depth)
		case xml.Comment: 
			printElmt("Comment", depth)
		case xml.ProcInst: 
			printElmt("ProcInst", depth)
		case xml.Directive: 
			printElmt("Directive", depth)
		default: fmt.Println("Unknown")
		}
	}
}

func printElmt(s string, depth int) {
	for n := 0; n < depth; n++ {
		fmt.Print("  ")
	}
	fmt.Println(s)
}

func checkError(err os.Error) {
        if err != nil {
                fmt.Println("Fatal error ", err.String())
                os.Exit(1)
        }
}
Note that the parser includes all CharData, including the whitespace between tags.

There is a potential trap in using this parser. It re-uses space for strings, so that once you see a token you need to copy its value if you want to refer to it later. Go has methods such as func (c CharData) Copy() CharData to make a copy of data.

Unmarshalling XML

Go provides a function Unmarshal and a method func (*Parser) Unmarshal to unmarshal XML into Go data structures. The unmarshalling is not perfect: Go and XML are different languages.

We consider a simple example before looking at the details. We take the XML document given earlier of

<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>
    

This can map onto the Go structures

type Person struct {
	Name Name
	Email []Email
}

type Name struct {
	Family string
	Personal string
}

type Email struct {
	Type string "attr"
	Address string "chardata"
}
    
This requires several comments:
  1. Unmarshalling uses the Go refelection package. This requires that all fields by public i.e. start with a capital letter. Go will use case-insensitive matching to match fields such as the XML string "name" to the field Name
  2. Repeated tags in the map to a slice in Go
  3. Attributes within tags will match to fields in a structure only if the Go filed has the tag "attr". This occurs with the field Type of Email, amtching the attribute "type" of the "email" tag
  4. If an XML tag has no attributes and only has character data, then it matches a string field by the same name (case-insensitive, though). So the tag "family" with character data "Newmarch" maps to the string field Family
  5. But if the tag has attributes, then it must map to a structure. Go assigns the character data to the field with tag chardata. This occurs with the "email" data and the field Address with tag chardata

A program to unmarshal the document above is


/* Unmarshal
*/

package main

import ("fmt"; "xml"; "strings"; "os")

type Person struct {
	Name Name
	Email []Email
}

type Name struct {
	Family string
	Personal string
}

type Email struct {
	Type string "attr"
	Address string "chardata"
}
	
func main() {
	str := `
<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>`

        r := strings.NewReader(str)

	var person Person
	err := xml.Unmarshal(r, &person)
	checkError(err)

	// now use the person structure e.g.
	fmt.Println("\"" + person.Name.Family + "\"")
	fmt.Println("\"" + person.Email[1].Address + "\"")
}

func checkError(err os.Error) {
        if err != nil {
                fmt.Println("Fatal error ", err.String())
                os.Exit(1)
        }
}


The strict rules are [LATER]

Marshalling XML

At present there is no support fo marshalling a Go data structure into XML. In this section we present a simple marshalling function that will give a basic serialisation. The result can be unmarshalled using the Go function Unmarshal of the previous section.

A straightforward but naive approach would be to write code that walks over your data structures, printing out results as it goes. But if is customised to your data types, then you wil need to change code each time the types change.

A better approach, and one that is used by the Go serialisation libraries is to use the reflection package. This is a package that allows you to examine data types and data structures from within a running program. The idea of reflection has been present in artificial intelligence programming for many years, but is still seen as a rather arcane technique for mainstream languages.

Go has two principal reflection types: reflect.Type gives information about the Go types, while reflect.Value gives information about a particular data value. Value has a method Type() that can return the type.

The simplest types and values correspond to primitive types. For example, there is IntType, BoolType etc, which can be used as values in type switches to determine the precise type of a Type. The corresponding value types are IntValue and BoolValue with methods such as Get to return the value.

A StructType is more complex, as it has methods to access the fields by

func (t *StructType) Field(i int) (f StructField)
    
and a StructField has methods such as Name to return the string value of the field's label. This is useful for examing the type structure.

A StructValue is useful for examining the value of fields of a data value. It has a method

func (v *StructValue) Field(i int) Value
    
which can be used to extract the value of each field.

The reflection process is basically stsrted by calling NewValue on a data object, and then examining its type and recursively walking through the values. What we do with each value is to surround it by tags, made of field names of the structures encountered.

There are two complexities: the first is that the initial data value will tpyically be a structure, and this doesn't have a field name as it is not itself part of a structure. For this starting case, we use the type name of the structure as XML tag name.

The second complexity comes with arrays or slices. In this case we need to work through each element of the array/slice, each time repeating the field name from the enclosing structure.

We define thre functions: Marshal which takes an initial data value. This prepares the XML document and creates the toplevel tag from the structure's type name. The second function recurses through the type values, switching on data types and writing tags from field names and values as XML character data. The third function handles the special case of slices, as the tag name needs to be kept for all of the elements of this slice.

We ignore pointers, channels, etc. We also do not produce attributes, just tags and character data. The program is


/* Marshal
*/

package main

import ("fmt"; "io"; "os"; "reflect"; "bytes")

type Person struct {
	Name Name
	Email []Email
}

type Name struct {
	Family string
	Personal string
}

type Email struct {
	Kind string "attr"
	Address string "chardata"
}

func main() {
	person := Person{
	Name: Name{Family: "Newmarch", Personal: "Jan"},
	Email: []Email{Email{Kind: "home", Address: "jan"},
			Email{Kind: "work", Address: "jan"}}}

	buff := bytes.NewBuffer(nil)
	Marshal(person, buff)
	fmt.Println(buff.String())
}

func Marshal(e interface{}, w io.Writer) {
	// make it a legal XML document
	w.Write([]byte("<?xml version=\"1.1\" encoding=\"UTF-8\" ?>\n"))

	// topvel e is a value and has no structure field, 
	// so use its type
	typ := reflect.Typeof(e)
	name := typ.Name()

	startTag(name, w)
	MarshalValue(reflect.NewValue(e), w)
	endTag(name, w)
}

func MarshalValue(v reflect.Value, w io.Writer) {
	t := v.Type()
	switch t := t.(type) {
	case *reflect.StructType:
		for n:= 0; n < t.NumField(); n++ {
			field := t.Field(n)

			vv := v.(*reflect.StructValue)

			// special case if it is a slice
			_, ok := vv.Field(n).Type().(*reflect.SliceType)
			if ok {
				// slice
				MarshalSliceValue(field.Name, 
					vv.Field(n).(*reflect.SliceValue), w)
			} else {
				// not a slice
				startTag(field.Name, w)
				MarshalValue(vv.Field(n), w)
				endTag(field.Name, w)
			}
		} 
	case *reflect.IntType, *reflect.UintType:
	case *reflect.BoolType:
	case *reflect.StringType:
		vv := v.(*reflect.StringValue)
		w.Write([]byte("   "+vv.Get()+"\n"))
	default:
	}
}

func MarshalSliceValue(tag string, v *reflect.SliceValue, w io.Writer) {
	for n := 0; n < v.Len(); n++ {
		startTag(tag, w)
		MarshalValue(v.Elem(n), w)
		endTag(tag, w)
	}
}

func startTag(s string, w io.Writer) {
	w.Write([]byte("<" + s + ">\n"))
}

func endTag(s string, w io.Writer) {
	w.Write([]byte("</" + s + ">\n"))
}

func checkError(err os.Error) {
        if err != nil {
                fmt.Println("Fatal error ", err.String())
                os.Exit(1)
        }
}


XHTML

HTML does not conform to XML syntax. It has unterminated tags such as '<:br>'. XHTML is a cleanup of HTML to make it compliant to XML. Documents in XHTML can be managed using the techniques above for XML.