Parsing XML
Go has an XML parser which is created using NewParser
. This takes an io.Reader
as parameter and returns a pointer to Parser
. The main method of this type is Token
which returns the next token in the input stream. The token is one of the types StartElement
, EndElement
, CharData
, Comment
, ProcInst
or Directive
.
The types are
StartElement
The type StartElement
is a structure with two field types:
type StartElement struct {
Name Name
Attr []Attr
}
type Name struct {
Space, Local string
}
type Attr struct {
Name Name
Value string
}
EndElement
This is also a structure
type EndElement struct {
Name Name
}
CharData
This type represents the text content enclosed by a tag and is a simple type
type CharData []byte
Comment
Similarly for this type
type Comment []byte
ProcInst
A ProcInst represents an XML processing instruction of the form <?target inst?>
type ProcInst struct {
Target string
Inst []byte
}
Directive
A Directive represents an XML directive of the form <!text>. The bytes do not include the <! and > markers.
type Directive []byte
A program to print out the tree structure of an XML document is
/* Parse XML
*/
package main
import (
"encoding/xml"
"fmt"
"io/ioutil"
"os"
"strings"
)
func main() {
if len(os.Args) != 2 {
fmt.Println("Usage: ", os.Args[0], "file")
os.Exit(1)
}
file := os.Args[1]
bytes, err := ioutil.ReadFile(file)
checkError(err)
r := strings.NewReader(string(bytes))
parser := xml.NewDecoder(r)
depth := 0
for {
token, err := parser.Token()
if err != nil {
break
}
switch t := token.(type) {
case xml.StartElement:
elmt := xml.StartElement(t)
name := elmt.Name.Local
printElmt(name, depth)
depth++
case xml.EndElement:
depth--
elmt := xml.EndElement(t)
name := elmt.Name.Local
printElmt(name, depth)
case xml.CharData:
bytes := xml.CharData(t)
printElmt("\""+string([]byte(bytes))+"\"", depth)
case xml.Comment:
printElmt("Comment", depth)
case xml.ProcInst:
printElmt("ProcInst", depth)
case xml.Directive:
printElmt("Directive", depth)
default:
fmt.Println("Unknown")
}
}
}
func printElmt(s string, depth int) {
for n := 0; n < depth; n++ {
fmt.Print(" ")
}
fmt.Println(s)
}
func checkError(err error) {
if err != nil {
fmt.Println("Fatal error ", err.Error())
os.Exit(1)
}
}
Note that the parser includes all CharData, including the whitespace between tags.
If we run this program against the person
data structure given earlier, it produces
person
"
"
name
"
"
family
" Newmarch "
family
"
"
personal
" Jan "
personal
"
"
name
"
"
email
"
[email protected]
"
email
"
"
email
"
[email protected]
"
email
"
"
person
"
"
Note that as no DTD or other XML specification has been used, the tokenizer correctly prints out all the white space (a DTD may specify that the whitespace can be ignored, but without it that assumption cannot be made.)
There is a potential trap in using this parser. It re-uses space for strings, so that once you see a token you need to copy its value if you want to refer to it later. Go has methods such as
func (c CharData) Copy() CharData
to make a copy of data.