HTML

The Web was originally created to serve HTML documents. Now it is used to serve all sorts of documents as well as data of dirrent kinds. Nevertheless, HTML is still the main document type delivered over the Web Go has basic mechanisms for parsing HTML documents, which are covered in this chapter

Introduction

The Web was originally created to serve HTML documents. Now it is used to serve all sorts of documents as well as data of dirrent kinds. Nevertheless, HTML is still the main document type delivered over the Web

HTML has been through a large number of versions, and HTML 5 is currently under development. There have also been many "vendor" versions of HTML, introducing tags that never made it into standards.

HTML is simple enough to be edited by hand. Consequently, many HTML documents are "ill formed", not following the syntax of the language. HTML parsers generally are not very strict, and will accept many "illegal" documents.

Go has basic parsing mechanisms based on a tokeniser. This allows you to process HTML documents as they are read, but if you want to, say, build a parse tree, then you have to do that yourself.

Tokenizer

The html implements a basic tokenizer that can used to parse HTML. The following program reads a file of HTML text and prints information. At present the package is incomplete, so halts with an error message from the Go team


/* Read HTML
*/

package main

import ("fmt"; "html"; "io/ioutil"; "os"; "strings")

func main() {
       if len(os.Args) != 2 {
                fmt.Println("Usage: ", os.Args[0], "file")
                os.Exit(1)
        }
        file := os.Args[1]
	bytes, err := ioutil.ReadFile(file)
	checkError(err)
	r := strings.NewReader(string(bytes))

	z := html.NewTokenizer(r)

	depth := 0
	for {
		tt := z.Next()

		for n := 0; n < depth; n++ {
			fmt.Print(" ")
		}

		switch tt {
		case html.ErrorToken:
			//fmt.Println("Error ", z.Error())
			os.Exit(0)
		case html.TextToken:
			fmt.Println(z.Token().String())
		case html.StartTagToken, html.EndTagToken:
			fmt.Println(z.Token().String())
			if tt == html.StartTagToken {
				depth++
			} else {
				depth--
			}
		}
	}

}

func checkError(err os.Error) {
        if err != nil {
                fmt.Println("Fatal error ", err.String())
                os.Exit(1)
        }
}