The Web was originally created to serve HTML documents. Now it is used to serve all sorts of documents as well as data of dirrent kinds. Nevertheless, HTML is still the main document type delivered over the Web Go has basic mechanisms for parsing HTML documents, which are covered in this chapter
The Web was originally created to serve HTML documents. Now it is used to serve all sorts of documents as well as data of dirrent kinds. Nevertheless, HTML is still the main document type delivered over the Web
HTML has been through a large number of versions, and HTML 5 is currently under development. There have also been many "vendor" versions of HTML, introducing tags that never made it into standards.
HTML is simple enough to be edited by hand. Consequently, many HTML documents are "ill formed", not following the syntax of the language. HTML parsers generally are not very strict, and will accept many "illegal" documents.
Go has basic parsing mechanisms based on a tokeniser. This allows you to process HTML documents as they are read, but if you want to, say, build a parse tree, then you have to do that yourself.
The html
implements a basic tokenizer that can
used to parse HTML. The following program reads a file of HTML text
and prints information. At present the package is incomplete,
so halts with an error message from the Go team
/* Read HTML
*/
package main
import ("fmt"; "html"; "io/ioutil"; "os"; "strings")
func main() {
if len(os.Args) != 2 {
fmt.Println("Usage: ", os.Args[0], "file")
os.Exit(1)
}
file := os.Args[1]
bytes, err := ioutil.ReadFile(file)
checkError(err)
r := strings.NewReader(string(bytes))
z := html.NewTokenizer(r)
depth := 0
for {
tt := z.Next()
for n := 0; n < depth; n++ {
fmt.Print(" ")
}
switch tt {
case html.ErrorToken:
//fmt.Println("Error ", z.Error())
os.Exit(0)
case html.TextToken:
fmt.Println(z.Token().String())
case html.StartTagToken, html.EndTagToken:
fmt.Println(z.Token().String())
if tt == html.StartTagToken {
depth++
} else {
depth--
}
}
}
}
func checkError(err os.Error) {
if err != nil {
fmt.Println("Fatal error ", err.String())
os.Exit(1)
}
}