AuditHub: data-witnessing at scale

How do you ensure mutations to your data store are always detectable - that is, auditable - for all parties? For us, existing solutions had two problems. They weren't meant to handle a lot of data, and their security is ultimately based on exceptionality: only if you're, say, the maintainer of the Linux kernel source, will the whole world be waiting to witness your latest tree head. AuditHub takes these issues head-on.

Data-witnessing on hash chains

AuditHub starts from the age-old cryptographic hash chain (or just “chain”, from here on): a simple sequence of blocks, where each block contains a hash reference to the block that precedes it. In our case, each block on the chain represents timestamped data: it carries hash references to user data (see “Asset Hashes” below), and because it is impossible to predict a hash, its presence proves that the user data existed at the time it was incorporated into the chain. This is a concept known as "linked time-stamping" (see various papers by Haber & Stornetta).

Hash-chaining ensures that new blocks can only be appended on the right end, we cannot otherwise avoid invalidating references to older blocks. Thus, there is only one important rule in this game: an honest chain operator must never remove blocks. We need observers, to continuously monitor that the chain is operated accordingly.

Now, for reasons of flexibility and scalability, we ought to be able to have many such chains, so that users don’t have to share chains, so that users can have data from different sources witnessed separately, and so that a higher overall throughput is easier to accomplish. The problem of course with operating chains all over the place, is that most of them would be lacking observers.

Witnessing chains

AuditHub solves the observer-problem by having chains carry additional hashes that reference blocks on other chains. As a result, rolling back history on any chain would create inconsistencies that are visible to other chains. In other words: each user, through their chains, also becomes an observer.

Theoretically speaking, all chains in this system could be considered equal, with each being appointed one or a number of witnesses. For practical reasons, we’ll introduce a bit more hierarchy, distinguishing two types of chains: chains that primarily record asset hashes, and chains that exclusively record other chains’ state - if we’d think of these in terms of a tree structure, the former would be leaf/inner nodes and the latter would be branch/outer nodes. To avoid overloaded terminology (you’ll find that we have all sorts of trees in our data model), we’ll refer to outer nodes as “user chains”, and instead of speaking of “parent/child nodes” we will use “witness chain” and “client chain”.

Now you may ask, how do we decide which chain can be a client to which witness? Could that be entirely random? The thing is, there is some complexity in chain witnessing: an auditor needs to be able to verify that a chain’s witness chains cannot provide false evidence either. We need to consider a few more things then.

Who’s your witness?

The main concern is that it should not be possible to hide a fork of your chain in the network: that is, a scenario in which you would create two histories from some point onwards, sending the respective blocks to different witness chains. The purpose of this would be to show some party a different branch of your chain - because you’re trying to defraud them - than the rest of the world gets to see. Whenever you’d see their IP address, you’d serve up a different series of timestamps, witnessed by a different set of witness chains.

If the network gets sufficiently large, we would not be able to detect such a situation: you cannot exhaustively search across hosts, looking for different timestamps that have the same parent. Therefore, it should be unambiguous which witness(es) we should query to corroborate any given timestamp. Our solution is to declare this choice as part of a preceding timestamp: if the parent timestamp insists that X is the only witness that may confirm the next timestamp, then there’s no way to fork the chain at that point. You cannot have the alternative branch corroborated by an undeclared witness Y.

What about the very first timestamp on the chain? Since it has no parent, you could create multiple chains from the start! There’s a fix for that too: we needed a chain ID anyway, and rather than have it be something arbitrary like a UUID, why not have the first timestamp’s hash be the ID of the chain? This makes it all unambiguous.

These two measures, together, ensure there can never be an undetected hidden branch: a chain is uniquely determined by its first block, and from that point onwards it is predetermined which witness chains can vouch for the next block.

Don’t trust the witness

The previous of course still leaves one option for mischief open: what if the client and witness collude? If the witness is happy to accept two versions of the client’s history, and is able to present either one at its discretion, we’re none the wiser.

So we cannot allow the witness any discretion here. A very elegant way to restrict the witness' options is offered by Laurie and Kasper, in a data structure called a Verifiable Map (in the original paper linked to here, it was referred to as a Sparse Merkle Tree). If the timestamps a witness holds for its clients are held in a Verifiable Map, any observer can verify that the witness holds only one latest timestamp for a given chain ID at any given time.

Additionally, so that the witness cannot sneakily roll back and forth between two client histories, interleaving them (which would be detectable, but it would require inspecting lots of timestamps), all updates should declare both the previously held and the new timestamp (it then takes only one hop on the witness chain, to check that the map indeed held the declared previous timestamp previously, and one hop on the client chain, to check that that previous timestamp is an ancestor of the new timestamp). To this end, the Verifiable Map is accompanied by a Verifiable Log, a concept described by the same authors, and originally referred to simply as a Merkle Hash Tree, in RFC6962.

Two-way anchoring

As a small but sometimes important aside, there is another benefit to cross-referencing between chains. The presence of your timestamp hash on another chain can only prove that the timestamp was generated before a particular point in time - it cannot tell you how long before. However, if that timestamp carries a hash generated on another chain, this proves that it must have been generated after that other time point. This allows auditors to place timestamps in time with a decent precision, even if the chain only receives new entries infrequently.

Another way of looking at this, is that we’re really aiming to use the hashes as ultimate time coordinates. After all, anybody could make up a plausible unix timestamp, and so chain hashes are a more meaningful positioning system.

Asset hashes

Entries on user chains carry proofs of existence for user data: we speak of “asset hashes”. These may be organised in a range of formats, so as to optimally accommodate the particular type of user data the chain shall handle, and the specification (and the SDKs) may add further formats in the future. For example, for streaming data a Verifiable Log may be appropriate, or just a Merkle tree holding the latest hashes only, or even a flat list of hashes (saving client-side compute power), whereas for disambiguation for versioned documents a Verifiable Map may work best.

Tracking document versions

The latter works in the same way as for the chain witnessing scheme mentioned earlier, but rather than mapping a chain ID to its latest timestamp hash, it maps a document ID (some agreed upon URN) to its latest hash. What this provides is a way for multiple parties to verify they’ve been presented with the same version of a document at some point in time.

For one example of a potential future format, we expect that something close to the tree structure that the git version control system uses could be very useful (this represents directories as flat lists of hashes of their child nodes), for the case where a chain is used to monitor a file system tree. In principle, the Verifiable Map as discussed could serve this use case, but it might be too computationally intensive for a large file tree.

A little gossip

Clients and chain hosts need to be able to do some "out of band" checking, to ensure that their view of the network is consistent with everyone else's. Very little information needs to be relayed: a recent timestamp from just one of the higher-tier witness chains should be sufficient.

Unintuitively, it is mostly safe to use information presented to you in-band to find other hosts to gossip with: this is the power of the Verifiable Map at work. Only for the very first connection does one need to bootstrap the list of peers.

Use model

To recapitulate: aided by our SDKs, a user continuously hashes their data streams into an appropriate tree format. Periodically (depending on the needs of the application - say, every second) the latest tree root and changes are packaged into a timestamp. Or alternatively, the hashes are submitted to our API service, and all this happens server-side. Either way, the timestamp is quickly registered on one or several witness chains.

Auditors that want to check on a piece of data may query the user chain to find the timestamp it first appeared in, and can then follow it along the hierarchy of witness chains up to a “known-good” chain - that is, a chain they’ve been observing. The essence and the beauty of the system is that, by guarding just their one "known-good" chain, each auditor continuously provides observer services for the entire network.