UTF-16 and Go

UTF-16 deals with arrays of short 16-bit unsigned integers. The package utf16 is designed to manage such arrays. To convert a normal Go string, that is a UTF-8 string, into UTF-16, you first extract the code points by coercing it into a []rune and then use utf16.Encode to produce an array of type uint16.

Similarly, to decode an array of unsigned short UTF-16 values into a Go string, you use utf16.Decode to convert it into code points as type []rune and then to a string. The following code fragment illustrates this

str := "百度一下,你就知道"

runes := utf16.Encode([]rune(str))
ints := utf16.Decode(runes)

str = string(ints)

These type conversions need to be applied by clients or servers as appropriate, to read and write 16-bit short integers, as shown below.

Little-endian and big-endian

Unfortunately, there is a little devil lurking behind UTF-16. It is basically an encoding of characters into 16-bit short integers. The big question is: for each short, how is it written as two bytes? The top one first, or the top one second? Either way is fine, as long as the receiver uses the same convention as the sender.

Unicode has addressed this with a special character known as the BOM (byte order marker). This is a zero-width non-printing character, so you never see it in text. But its value 0xfffe is chosen so that you can tell the byte-order:

  • In a big-endian system it is FF FE
  • In a little-endian system it is FE FF

Text will sometimes place the BOM as the first character in the text. The reader can then examine these two bytes to determine what endian-ness has been used.

UTF-16 client and server

Using the BOM convention, we can write a server that prepends a BOM and writes a string in UTF-16 as

/* UTF16 Server
 */
package main

import (
    "fmt"
    "net"
    "os"
    "unicode/utf16"
)

const BOM = '\ufffe'

func main() {

    service := "0.0.0.0:1210"
    tcpAddr, err := net.ResolveTCPAddr("tcp", service)
    checkError(err)

    listener, err := net.ListenTCP("tcp", tcpAddr)
    checkError(err)

    for {
        conn, err := listener.Accept()
        if err != nil {
            continue
        }

        str := "j'ai arrêté"
        shorts := utf16.Encode([]rune(str))
        writeShorts(conn, shorts)

        conn.Close() // we're finished
    }
}

func writeShorts(conn net.Conn, shorts []uint16) {
    var bytes [2]byte

    // send the BOM as first two bytes
    bytes[0] = BOM >> 8
    bytes[1] = BOM & 255
    _, err := conn.Write(bytes[0:])
    if err != nil {
        return
    }

    for _, v := range shorts {
        bytes[0] = byte(v >> 8)
        bytes[1] = byte(v & 255)

        _, err = conn.Write(bytes[0:])
        if err != nil {
            return
        }
    }
}

func checkError(err error) {
    if err != nil {
        fmt.Println("Fatal error ", err.Error())
        os.Exit(1)
    }
}

while a client that reads a byte stream, extracts and examines the BOM and then decodes the rest of the stream is

/* UTF16 Client
 */
package main

import (
    "fmt"
    "net"
    "os"
    "unicode/utf16"
)

const BOM = '\ufffe'

func main() {
    if len(os.Args) != 2 {
        fmt.Println("Usage: ", os.Args[0], "host:port")
        os.Exit(1)
    }
    service := os.Args[1]

    conn, err := net.Dial("tcp", service)
    checkError(err)

    shorts := readShorts(conn)
    ints := utf16.Decode(shorts)
    str := string(ints)

    fmt.Println(str)

    os.Exit(0)
}

func readShorts(conn net.Conn) []uint16 {
    var buf [512]byte

    // read everything into the buffer
    n, err := conn.Read(buf[0:2])
    for true {
        m, err := conn.Read(buf[n:])
        if m == 0 || err != nil {
            break
        }
        n += m
    }

    checkError(err)
    var shorts []uint16
    shorts = make([]uint16, n/2)

    if buf[0] == 0xff && buf[1] == 0xfe {
        // big endian
        for i := 2; i < n; i += 2 {
            shorts[i/2] = uint16(buf[i])<<8 + uint16(buf[i+1])
        }
    } else if buf[1] == 0xfe && buf[0] == 0xff {
        // little endian
        for i := 2; i < n; i += 2 {
            shorts[i/2] = uint16(buf[i+1])<<8 + uint16(buf[i])
        }
    } else {
        // unknown byte order
        fmt.Println("Unknown order")
    }
    return shorts

}

func checkError(err error) {
    if err != nil {
        fmt.Println("Fatal error ", err.Error())
        os.Exit(1)
    }
}

results matching ""

    No results matching ""