XML is now a widespread way of representing complex data structures serialised into text format. It is used to describe documents such as DocBook and XHTML. It is used in specialised markup languages such as MathML and CML (Chemistry Markup Language). It is used to encode data as SOAP messages for Web Services, and the Web Service can be specified using WSDL (Web Services Description Language).
At the simplest level, XML allows you to define your own tags
for use in text documents. Tags can be nested and can be
interspersed with text. Each tag can also contain attributes
with values. For example,
<person>
<name>
<family> Newmarch </family>
<personal> Jan </personal>
</name>
<email type="personal">
jan@newmarch.name
</email>
<email type="work">
j.newmarch@boxhill.edu.au
</email>
</person>
The structure of any XML can be described in a number of ways:
There is argument over the relative value of each way of defining the structure of an XML document. We won't buy into that, as Go does not suport any of them. Go cannot check for validity of any document against a schema, but only for well-formedness.
Four topics are discussed in this chapter: parsing an XML stream, marshalling and unmarshalling Go data into XML, and XHTML.
Go has an XML parser which is created using NewParser
.
This takes an io.Reader
as parameter and returns a pointer
to Parser
. The main method of this type is
Token
which returns the next token in the input
stream. The token is one of the types StartElement
,
EndElement
, CharData
, Comment
,
ProcInst
or Directive
.
The types are
StartElement
StartElement
is a structure with two
field types:
type StartElement struct {
Name Name
Attr []Attr
}
type Name struct {
Space, Local string
}
type Attr struct {
Name Name
Value string
}
EndElement
type EndElement struct {
Name Name
}
CharData
type CharData []byte
Comment
type Comment []byte
ProcInst
type ProcInst struct {
Target string
Inst []byte
}
Directive
type Directive []byte
A program to print out the tree structure of an XML document is
/* Parse XML
*/
package main
import ("fmt"; "xml"; "io/ioutil"; "os"; "strings")
func main() {
if len(os.Args) != 2 {
fmt.Println("Usage: ", os.Args[0], "file")
os.Exit(1)
}
file := os.Args[1]
bytes, err := ioutil.ReadFile(file)
checkError(err)
r := strings.NewReader(string(bytes))
parser := xml.NewParser(r)
depth := 0
for {
token, err := parser.Token()
if err != nil {
break
}
switch t := token.(type) {
case xml.StartElement:
elmt := xml.StartElement(t)
name := elmt.Name.Local
printElmt(name, depth)
depth++
case xml.EndElement:
depth--
elmt := xml.EndElement(t)
name := elmt.Name.Local
printElmt(name, depth)
case xml.CharData:
bytes := xml.CharData(t)
printElmt("\"" + string([]byte(bytes)) + "\"", depth)
case xml.Comment:
printElmt("Comment", depth)
case xml.ProcInst:
printElmt("ProcInst", depth)
case xml.Directive:
printElmt("Directive", depth)
default: fmt.Println("Unknown")
}
}
}
func printElmt(s string, depth int) {
for n := 0; n < depth; n++ {
fmt.Print(" ")
}
fmt.Println(s)
}
func checkError(err os.Error) {
if err != nil {
fmt.Println("Fatal error ", err.String())
os.Exit(1)
}
}
Note that the parser includes all CharData, including the
whitespace between tags.
There is a potential trap in using this parser. It re-uses space
for strings, so that once you see a token you need to copy its
value if you want to refer to it later. Go has methods such as
func (c CharData) Copy() CharData
to make a copy
of data.
Go provides a function Unmarshal
and a method
func (*Parser) Unmarshal
to unmarshal XML into
Go data structures. The unmarshalling is not perfect:
Go and XML are different languages.
We consider a simple example before looking at the details.
We take the XML document given earlier of
<person>
<name>
<family> Newmarch </family>
<personal> Jan </personal>
</name>
<email type="personal">
jan@newmarch.name
</email>
<email type="work">
j.newmarch@boxhill.edu.au
</email>
</person>
This can map onto the Go structures
This requires several comments:
type Person struct {
Name Name
Email []Email
}
type Name struct {
Family string
Personal string
}
type Email struct {
Type string "attr"
Address string "chardata"
}
Name
Type
of Email
, amtching the attribute
"type" of the "email" tag
string
field by the same name
(case-insensitive, though). So the tag "family" with
character data "Newmarch" maps to the string field Family
chardata
. This occurs with the "email" data
and the field Address
with tag chardata
A program to unmarshal the document above is
/* Unmarshal
*/
package main
import ("fmt"; "xml"; "strings"; "os")
type Person struct {
Name Name
Email []Email
}
type Name struct {
Family string
Personal string
}
type Email struct {
Type string "attr"
Address string "chardata"
}
func main() {
str := `
<person>
<name>
<family> Newmarch </family>
<personal> Jan </personal>
</name>
<email type="personal">
jan@newmarch.name
</email>
<email type="work">
j.newmarch@boxhill.edu.au
</email>
</person>`
r := strings.NewReader(str)
var person Person
err := xml.Unmarshal(r, &person)
checkError(err)
// now use the person structure e.g.
fmt.Println("\"" + person.Name.Family + "\"")
fmt.Println("\"" + person.Email[1].Address + "\"")
}
func checkError(err os.Error) {
if err != nil {
fmt.Println("Fatal error ", err.String())
os.Exit(1)
}
}
The strict rules are [LATER]
At present there is no support fo marshalling a Go data structure
into XML. In this section we present a simple marshalling
function that will give
a basic serialisation. The result can be unmarshalled using
the Go function Unmarshal
of the previous section.
A straightforward but naive approach would be to write code that walks over your data structures, printing out results as it goes. But if is customised to your data types, then you wil need to change code each time the types change.
A better approach, and one that is used by the Go serialisation
libraries is to use the reflection
package.
This is a package that allows you to examine data types and
data structures from within a running program. The idea of
reflection has been present in artificial intelligence
programming for many years, but is still seen as a rather arcane
technique for mainstream languages.
Go has two principal reflection types:
reflect.Type
gives information about the Go types,
while reflect.Value
gives information about a
particular data value. Value
has a method
Type()
that can return the type.
The simplest types and values correspond to primitive types.
For example, there is IntType
, BoolType
etc, which can be used as values in type switches to determine the
precise type of a Type
. The corresponding value types
are IntValue
and BoolValue
with
methods such as Get
to return the value.
A StructType
is more complex, as it has methods
to access the fields by
and a
func (t *StructType) Field(i int) (f StructField)
StructField
has methods such as
Name
to return the string value of the field's
label. This is useful for examing the type structure.
A StructValue
is useful for examining the value
of fields of a data value. It has a method
which can be used to extract the value of each field.
func (v *StructValue) Field(i int) Value
The reflection process is basically stsrted by calling
NewValue
on a data object, and then examining
its type and recursively walking through the values.
What we do with each value is to surround it by tags,
made of field names of the structures encountered.
There are two complexities: the first is that the initial data value will tpyically be a structure, and this doesn't have a field name as it is not itself part of a structure. For this starting case, we use the type name of the structure as XML tag name.
The second complexity comes with arrays or slices. In this case we need to work through each element of the array/slice, each time repeating the field name from the enclosing structure.
We define thre functions: Marshal
which takes an
initial data value. This prepares the XML document and creates
the toplevel tag from the structure's type name.
The second function
We ignore pointers, channels, etc. We also do not produce attributes, just tags and character data. The program is
/* Marshal
*/
package main
import ("fmt"; "io"; "os"; "reflect"; "bytes")
type Person struct {
Name Name
Email []Email
}
type Name struct {
Family string
Personal string
}
type Email struct {
Kind string "attr"
Address string "chardata"
}
func main() {
person := Person{
Name: Name{Family: "Newmarch", Personal: "Jan"},
Email: []Email{Email{Kind: "home", Address: "jan"},
Email{Kind: "work", Address: "jan"}}}
buff := bytes.NewBuffer(nil)
Marshal(person, buff)
fmt.Println(buff.String())
}
func Marshal(e interface{}, w io.Writer) {
// make it a legal XML document
w.Write([]byte("<?xml version=\"1.1\" encoding=\"UTF-8\" ?>\n"))
// topvel e is a value and has no structure field,
// so use its type
typ := reflect.Typeof(e)
name := typ.Name()
startTag(name, w)
MarshalValue(reflect.NewValue(e), w)
endTag(name, w)
}
func MarshalValue(v reflect.Value, w io.Writer) {
t := v.Type()
switch t := t.(type) {
case *reflect.StructType:
for n:= 0; n < t.NumField(); n++ {
field := t.Field(n)
vv := v.(*reflect.StructValue)
// special case if it is a slice
_, ok := vv.Field(n).Type().(*reflect.SliceType)
if ok {
// slice
MarshalSliceValue(field.Name,
vv.Field(n).(*reflect.SliceValue), w)
} else {
// not a slice
startTag(field.Name, w)
MarshalValue(vv.Field(n), w)
endTag(field.Name, w)
}
}
case *reflect.IntType, *reflect.UintType:
case *reflect.BoolType:
case *reflect.StringType:
vv := v.(*reflect.StringValue)
w.Write([]byte(" "+vv.Get()+"\n"))
default:
}
}
func MarshalSliceValue(tag string, v *reflect.SliceValue, w io.Writer) {
for n := 0; n < v.Len(); n++ {
startTag(tag, w)
MarshalValue(v.Elem(n), w)
endTag(tag, w)
}
}
func startTag(s string, w io.Writer) {
w.Write([]byte("<" + s + ">\n"))
}
func endTag(s string, w io.Writer) {
w.Write([]byte("</" + s + ">\n"))
}
func checkError(err os.Error) {
if err != nil {
fmt.Println("Fatal error ", err.String())
os.Exit(1)
}
}
HTML does not conform to XML syntax. It has unterminated tags such as '<:br>'. XHTML is a cleanup of HTML to make it compliant to XML. Documents in XHTML can be managed using the techniques above for XML.