trillian

Transparent Logging: A Guide

Introduction

The Trillian project generalizes the ideas behind Certificate Transparency, to allow transparent, append-only logging of arbitrary data.

This document works through some of the high-level design decisions involved in creating a transparent Log; it’s a good idea to have a clear understanding of these decisions before writing any code.

We assume the reader has a rough idea of the concepts involved in transparent Logs: Merkle trees, inclusion/consistency proofs, signed tree heads etc.

Running Examples

Throughout this document, we will use three example scenarios for the use of a Log, to illustrate the considerations involved:

Two of these examples will turn out to be a good match for transparent Logging; the other, not so much.

Ecosystem

The first fundamental question for designing a transparent Log is: Why are you logging?

Understanding what the Log is intended to achieve allows you to check whether the transparent, append-only characteristics of the Log actually achieve those goals. It also helps you to understand how much of a wider ecosystem – auditors, monitors, indexes – also needs to be built up around the Log.

Log Contents

The second fundamental question for designing a transparent Log is: What are you logging?

Ideally, you will have a single sentence that answers this question, and answers it sufficiently clearly that there is little ambiguity left in the implementation – the definition of the leaf contents for the Log should be mostly covered by this sentence.

For example:

Reality soon intrudes and dilutes the clarity of these statements, but they remain useful as a guiding principle for what is and is not allowed as a variation on the log contents.

Leaf Hashing

A Trillian-based transparent Log deals with two distinct leaf hashes, which need to be defined for each Log application.

The first hash for a leaf in Log is the Merkle Hash; this is the hash value that percolates up the Merkle tree and is therefore incorporated into the root hash for the Log; the cryptographic guarantees of the Log’s Merkle tree only apply to data included in the Merkle hash.

The default Merkle hash for a Trillian Log leaf is SHA-256(0x00 | leaf.LeafValue).

A Trillian application may also specify a separate per-leaf Identity Hash; this identifies which leaf values should be considered equivalent, in the sense that an existing leaf with a given (application-provided) identity hash prevents Trillian from accepting any new leaves with the same identity hash.

This feature is primarily designed to allow applications to detect (and squash) semantic duplicates, where two different leaf values actually represent the same underlying object.

One particular example of this is for applications where it is important to record the time of logging a thing, together with the thing itself – where a later attempt to log something that is already logged should be rejected. This in turn is related to inclusion promises, see below.

Leaf Size / Accessibility

Another consideration for designing a new transparent Log is the size of each leaf; Trillian can accommodate fairly large leaves, but would struggle with storing multi-gigabyte leaf contents2.

In this situation we can appeal to the fundamental theorem of security engineering:

“We can solve any security problem by introducing an extra level of hashing.”

A large leaf blob can be stored in a separate (non-transparent) data store (e.g. a content-addressable store which allows retrieval of the blobs via their hashes), and the transparent Log carries the cryptographic hashes of the blobs rather than the blobs themselves.

This approach may also be useful if there are situations where full public access to the logged data is not appropriate.

Admission Control

The general idea of transparency can be broken down into two constituent parts, which we’ll call read-transparency and write-transparency:

Allowing write-transparency is vital for encouraging an ecosystem to grow up around the Log(s), and helps to reassure external (read-transparency) users that no “filtering” has been applied to content before it even reaches the log.

However, it’s worth making clear: a write-transparent Log allows arbitrary people on the Internet to write data into your append-only Log – and that data can’t be removed3 without destroying the Log.

That’s a sufficiently terrifying prospect to motivate having strict admission criteria: checks that submitted content has to pass before being included in the Log. (However, this isn’t specific to write-transparent Logs; even if submissions are restricted to a whitelist of clients, the admission criteria still need to be detailed.)

Examples:

These examples illustrate that good admission criteria typically include two key aspects:

For our examples:

Finally, note that a transparent Log normally acts as an observatory, not as a police officer, in the overall ecosystem. With this in mind, it’s often sensible for the structure checks on submissions to be lax, so that technically-invalid objects that are still signed and distributed can be monitored and attributed.

Inclusion Proofs vs. Promises

Transparent Logs have two distinct mechanisms for guaranteeing inclusion of a particular entry in the Log.

A related factor is the number of entries that are incorporated into each new tree head issued by the Log, which we’ll call the tree head batch size.

The batch size per new tree head may be important for privacy reasons: a small batch size, particularly a batch of size 1, means that a new signed tree head correlates directly with a single submission. This in turn means that a user requesting proofs to/from that tree head, or gossiping that tree head, is likely to be interested in that specific entry in the Log.

So, if privacy is a concern:

A new transparent Log should consider whether it (and the surrounding ecosystem) requires both inclusion promises and proofs, or just the latter, based on assessing the concerns above together with other factors:


1: Prefix included for second-preimage attack resistance.

2: The exact limits depend on the specific storage implementation in use (e.g. ~10MB for CloudSpanner).

3: This isn’t strictly true – the Log could replace a removed leaf with a new leaf type that just holds the Merkle hash value of the full leaf (and define that the hash value for such a leaf is the identity).  However, the omission/replacement would be visible to a monitor that retrieved the log contents.