This document is a description of our finallisted implementation for SIGMOD 2009 Programming Contest. See the contest results at here.
Our solutions are based on multiversion concurrency control, thus we can support lock-free reads. We implemented two different index structures, the trie version for the first stage, and CSB+ tree-like version for the second stage. Even though we tried several optimizations for the CSB+ tree-like implementation, we could not beat our simple trie version in the evaluation setting of the contest.
In stage 1, our target is to achieve high performance through optimistic concurrency control and simple index structure.
We investigated several different index structures and finally decided to use trie. The decision of using trie as index is based on its simplicity and the way how the benchmark test data is generated. For string keys each letter is used to locate the corresponding child of each trie node; for integer keys every 6 bits from high to low are used to index the child. A list of different payloads is maintained for each distinct key. The keys of an index are also linked in order using a double linked list to support efficient getNext operation.
Following graph illustrates the structure of our trie index:
It is interesting that team DEXTER also adopted the trie structure, but a better design with fixed height
We used multi-version in our implementation such that transactions reading old versions of the database are not affected by updating transactions, even if the updates have been committed. A transaction acquires the latest available version number when it begins, and consistently reads/modifies that version in all later operations. Multi-version is realized using revision lists, where each revision in the list keeps track of the version number of the update that inserts or deletes the record. In our implementation, we did not consider garbage collection for revision lists because the memory consumption is acceptable for the contest setting, however it is easy to implement a GC mechenism without compromising the performance.
Our design does not require read-only trasactions to acquire locks(except for when a thread opens an index, where the unordered_map that keeps all the indices needs to be protected). Two properties make this possible:
Several data structures can be made lock-free for read-only threads given these properties, for example, a linked list can be prepanded by preparing the new node, putting up a memory barrier, and then atomically putting the new node onto the list. With multi-version, no matter a reader sees or does not see recent updates, it won't be affected.
When a transaction tries to update an index, it acquires an exclusive lock (until commit or abort) on the index. All updates to the index will be using a temporary version number, so that other trasacions cannot see it. The transaction also keeps the list of requested updates. When it aborts, all updates will be abandoned by modifying their version numbers; when it commits, the transaction acquires locks on all open indices, gets the next permenent version and replaces the temporary version numbers with the permenent one.
In stage 2, we spot on the cache line alignment optimization. The previous trie design is not very friendly to such optimization, thus we decided shifting to CSB+ tree like structure. We believe we made a mistake in this decision, because CSB+ tree is way much more complex than trie, and it was too risky to make such dramatic change to the implementation in limited time without assuring the performance gain.
Within each node of the tree, the keys are organized as a single linked list so that reads can be lock-free in presence of insertions. Node splits is realized by copying/splitting the old node into two separate new nodes, while leaving the old node unchanged to guarantee the transparency to other reads. Deletions are implemented in the same way using multi-version mechanism. Unfortunately it turns out the performance is worse than the trie implementation.