I've never been satisfied with the size header in heap-allocated types though, and the two bits reserved for GC pretty much forces you to use some mark-sweep variant. Integrating something like age-oriented collection with reference counting for the old generation would require an entirely new word for every block of memory. This is a prohibitive cost in functional programs.
Removing the Size Header
Reading about BIBOP techniques used in the Streamflow allocator gave me the idea of using page masking to determine the block size.
As in Streamflow, consider an allocator where each page obtained from the OS was partitioned into a list of equally-sized blocks. Each of these blocks does not need its size stored in an object header, since the size of the block is a property of the page it was allocated from, and we can easily obtain the page address by simply masking the pointer to the block. So the first block of every page is dedicated to storing the size, and all subsequent blocks are placed in the free list.
This works for structured data allocated in fixed-sized blocks, but unstructured data is dynamic in extent and can span pages. Unstructured data must also carry the GC header in the first word, even when spanning pages. However, we know that arrays and strings are the only primitive unstructured data described in ZINC, and they must now both carry their size information as part of the structure. We can thus easily identify "small" strings and arrays that could fit into a fixed-sized block.
As a possible space optimization, such "small" arrays and strings don't need to be accompanied by a size field. We can perform a simple test to distinguish "small" arrays: if the array pointer is more than one word off from a page-aligned address, it is structured data, since unstructured data that spans pages always starts on a page address + GC header size.
We now have a full 24 free bits in the block header, which we can reserve for use by the GC. 24 bits is enough to employ reference counting, or a hybrid mark-sweep for the nursery, reference counting for the old generation as in age-oriented collection. The GC to employ is now completely at the discretion of the runtime, and can even be changed while the program is running.
There is a downside of course. Streamflow allocates fixed-sized blocks for sizes ranging from 4 bytes up to 2kB. Considering the above approach uses the first block to store the block size, we would waste up to 2kB on the larger block sizes. We could amortize this cost by allocating more pages and spreading the cost across them, but then we lose the simple page masking. We could avoid large fixed-sized blocks, maybe cutting them off at 512 bytes, but then we would still need some way to store the size for these "medium-sized" blocks.
We can't simply use a single word of the page to store the size, as we would then have an odd number of bytes to divide into fixed size blocks. Again, it's only a problem for larger block sizes. You can't split 4092 bytes into two 2kB blocks.
As a general solution to these problems, we can eliminate the size tag on the page entirely by maintaining a sorted array of address ranges and the block size of the range. When we need to discover the size of a block, we perform a binary search to find the address range the address falls in, and extract the size. This operation is O(log n), where n is the number of distinct address ranges, ie. sequences of consecutive pages allocated to the same block size. I would expect the number of such page regions to be under 50, so the logarithmic cost should be very low in practice, though the additional indirections accessing the array can be costly due to cache misses.
A block header representation that provides for GC independence would be compelling. The page-level size header is promising, but clearly has problems. Hopefully, some clever person will devise a general way to handle the larger fixed sized blocks without wasting too much space.
The solution to the problems with the page-level size header using a binary search is a mixed blessing. The additional runtime cost in discovering block size may be prohibitive, since it must be performed during GC traversal, and on every free operation. The costs of free might possibly be amortized somehow when done in bulk, but the cost incurred by GC traversal seems much more challenging to eliminate.