Saturday, September 14, 2013

Go Slices and the Hash API

I finished my first little program in Go months ago.  One of the mysteries was "Why does Hash.Sum take a byte slice if it doesn't write into it?"

Well, the question is broken.  Let's look at some code that attempts to hash a file's content and save the result into a file-info structure, but which doesn't actually work:
sumbuf := make([]byte, hsh.Size())
hsh.Sum(sumbuf)
fi.hash = sumbuf
What happens?  This allocates a byte slice with both len and cap equal to the output size of the underlying hash, filled with 0x00 bytes.  It then passes into hsh.Sum (which, in this particular case, was crypto/sha512) which copies the internal state, finalizes the hash, then wraps everything up with return append(in, digest[:size]...).

Since sumbuf didn't have any room (make(T, n) is equivalent to make(T, n, n)) the append didn't have anywhere to put it.  So it did what it had to do, allocating a new backing array of sufficient size, then copying both in (aka sumbuf) and the elements from the digest slice into it.  The expanded slice backed by this new array then gets returned... only to be discarded by me.  fi.hash ends up being the slice I allocated containing all zero, which made the program think all files were duplicates.  Oops!

What works?
sumbuf := make([]byte, 0, hsh.Size())
fi.hash = hsh.Sum(sumbuf)
First, the return value must be assigned: the original slice's backing array is shared, but append creates a new slice with a larger length to hold the data.  fmt.Printf's %p will show that fi.hash and sumbuf live at the same address, and they have the same capacity as well, but the len of each differs.

Second, if we don't want append to allocate a fresh array, we need a slice with enough free space after its length to hold the data it wants to copy there.  A slice with nothing in it (yet) and the capacity of the hash length is precisely the smallest thing to fit this description.

Now that we have some working code, let's reflect on the point of passing in a slice to Hash.Sum in the first place.  The goal is to avoid allocating inside Sum, if the caller has space for it.  But Sum already allocates and copies a slice: it needs to finalize the hash for Sum, without disturbing the original state so that writers can still add data.  By working in a temporary buffer on the stack and copying at the end, it doesn't make a discrete allocation on the heap, but it still needs to ask append to copy it.

Why not begin d := append(in, _____) and then work inside the caller's buffer directly?  My guess is that working in a private buffer prevents Sum from leaving partial state visible to the rest of the program.  I don't know if it is, but I would not be surprised if append is atomic from the point of view of running goroutines, and clearly Sum needs to allocate in order to be re-entrant.

No comments: