Wednesday, April 4, 2012

Immutable Hash Array Mapped Trie in C#

I just completed an implementation of an immutable hash array mapped trie (HAMT) in C#. The HAMT is an ingenious hash tree first described by Phil Bagwell. It's used in many different domains because of its time and space efficiency, although only some languages use the immutable variant. For instance, Clojure uses immutable HAMTs to implement arrays/vectors which are essential to its concurrency.

The linked implementation is pretty much the bare minimum supporting add, remove and lookup operations, so if you're interested in learning more about it, it's a good starting point. Many thanks also to krukow's fine article which helped me quickly grasp the bit-twiddling needed for the HAMT. The tree interface is basically this:

/// <summary>
/// An immutable hash-array mapped trie.
/// </summary>
/// <typeparam name="K">The type of keys.</typeparam>
/// <typeparam name="T">The type of values.</typeparam>
public class Tree<K, T> : IEnumerable<KeyValuePair<K, T>>
{
    /// <summary>
    /// The number of elements in the tree.
    /// </summary>
    public virtual int Count { get; }

    /// <summary>
    /// Find the value for the given key.
    /// </summary>
    /// <param name="key">The key to lookup.</param>
    /// <returns>
    /// The value corresponding to <paramref name="key"/>.
    /// </returns>
    /// <exception cref="KeyNotFoundException">
    /// Thrown if the key is not found in this tree.
    /// </exception>
    public T this[K key] { get; }

    /// <summary>
    /// Add the given key-value pair to the tree.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="value">The value for the given key.</param>
    /// <returns>A tree containing the key-value pair.</returns>
    public Tree<K, T> Add(K key, T value);

    /// <summary>
    /// Remove the element with the given key.
    /// </summary>
    /// <param name="key">The key to remove.</param>
    /// <returns>
    /// A tree without the value corresponding to
    /// <paramref name="key"/>.
    /// </returns>
    public Tree<K, T> Remove(K key);
}

No benchmarks yet, it's still early stages. The implementation is based on a few functions from my Sasa class library, primarily some bit-twiddling functions from Sasa.Binary.

The whole implementation is literally 200 lines of code, excluding comments. The only deficiency of the current implementation is that it doesn't properly handle hash collisions. A simple linear chain on collision would be a simple extension. I also need an efficient tree merge operation. I was initially implementing Okasaki's Patricia trees because of their efficient merge, but HAMTs are just so much better in every other way. If anyone has any pointers to efficient merging for HAMTs, I'd be much obliged!

State of Sasa v0.9.4

Sasa itself is currently undergoing an aggressive reorganization in preparation for the 0.9.4 release. A lot of the optional abstractions are moving from Sasa core into their own assemblies. A lot of the useful abstractions are relatively stand-alone. It currently stands as follows, with dependencies listed between []:

Production Ready

  • Sasa [standalone]: tuples, option types that work with both types and structs, string extensions, IEnumerable extensions, thread and null-safe event extensions, type-safe enum extensions, lightweight type-safe wrappers for some system classes, eg. WeakReference and Delegate, extensions for code generation and debugging, and generic number extensions. The goal is to provide only essential extensions to address deficiencies in the system class libraries.
  • Sasa.Binary [standalone]: low-level bit twiddling functions, endian conversions, and portable BinaryReader and BinaryWriter.
  • Sasa.Collections [Sasa, Sasa.Binary]: efficient immutable collections library, including purely functional stacks, queues, lists, and trees. Tree needs some more testing obviously, since it's a rather new addition.
  • Sasa.Mime [standalone]: a simple library encapsulating standard media types and file extensions. It also provides an interface for extending these associations at runtime.
  • Sasa.Statistics [standalone]: a few basic numerical calculations, like standard deviation, and Pierce's criterion used to remove outliers from a data set.
  • Sasa.Net [Sasa, Sasa.Collections]: MIME mail message parsing to System.Net.Mail.MailMessage (most libraries provide unfamiliar, custom mail and attachment classes), a POP3 client, and RFC822 header parsing.
  • Sasa.Contracts [Sasa]: I've used the runtime preconditions subset of the standard .NET contracts for years. I haven't gotten around to adding postconditions and invariants support to ilrewriter.
  • ilrewriter.exe [Sasa]: the IL rewriter currently only erases Sasa.TypeConstraint<T> from your code, which allows you to specify type constraints that C# normally disallows, ie. T : Delegate, or T : Enum.

Beta

These abstractions work, but haven't seen the production use or stress testing the above classes have.

  • Sasa.TM [Sasa]: software transactional memory, super-fast thread-local data (much faster than ThreadLocal<T>!).
  • Sasa.Reactive [Sasa]: building on Rx.NET, this provides Property<T> which is a mutable, reactive cell with a getter/setter. Any changes automatically propagate to observers. NamedProperty<T> inherits from Property<T> and further implements INotifyPropertyChanged and INotifyPropertyChanging.
  • Sasa.Parsing [Sasa]: implements a simple, extensible Pratt parser. Grammars are generic and can be extended via standard inheritance. The test suite is extensive, although I've only use this in private projects, not production code.
  • Sasa.Linq [standalone]: base classes for LINQ expression visitors and query providers. Not too uncommon these days, but I've had them in Sasa for many years.

Currently Broken

These assemblies are undergoing some major refactoring, and are currently rather broken.

  • Sasa.Dynamics [Sasa, Sasa.Collections]: blazingly fast, type-safe runtime reflection. This code underwent significant refactoring, and I recently realized that the patterns being used here could be abstracted even further by providing multiple dispatch for .NET. See the multiple-dispatch branch of the repository for the current status of that work. This should be complete for Sasa v0.9.4.
  • Sasa.Serialization [Sasa, Sasa.Dynamics]: a compact, fast serializer based on Sasa.Dynamics. Waiting on the completion of Sasa.Dynamics.

Deprecated

These assemblies are now deprecated, either because they saw little use, were overly complex, or better alternatives now exist.

  • Sasa.Concurrency [Sasa]: primarily an overly complex implementation of futures based on Alice ML semantics. In a future release, Sasa.Concurrency will strip out futures, absorb Sasa.TM, and also provide a deterministic concurrency library based on concurrent revision control, which I believe to be inherently superior to STM.

Toy

These assemblies are not really meant for serious use, primarily because they don't fit with standard .NET idioms.

  • Sasa.FP [Sasa, Sasa.Collections]: some less useful functional idioms, like delegate currying and tupling, binomial trees, trivial immutable sets, and either types.
  • Sasa.Arrow [Sasa]: a convenient interface for arrows. Definitely not a idiomatic .NET!
  • Sasa.Linq.Expressions [Sasa]: extensions to compose LINQ expressions. Also provides some typed expression trees, as opposed to the standard untyped ones. In theory it should work, and the code all type checks, but pretty much 0 testing at the moment.