Keeping state in mtail programs

The program is run on each log line from start to finish, with no loops. The only state emitted by the program is the content of the exported metrics. Metrics can be read by the program, though, so exported metrics are the place to keep state between lines of input.

It’s often the case that a log line is printed by an application at the start of some session-like interaction, and another at the end. Often these sessions have a session identifier, and every intermediate event in the same session is tagged with that identifier. Using map-valued exported metrics is the way to store session information keyed by session identifier.

The example program rsyncd.mtail shows how to use a session tracking metric for measuring the total user session time.

counter connection_time_total
hidden gauge connection_time by pid

/connect from \S+/ {
  connection_time[$pid] = timestamp()

  del connection_time[$pid] after 72h

/sent .* bytes  received .* bytes  total size \d+/ {
  connection_time_total += timestamp() - connection_time[$pid]
  del connection_time[$pid]

rsyncd uses a child process for each session so the pid field of the log format contains the session identifier in this example.

hidden metrics

A hidden metric is only visible to the mtail program, it is hidden from export. Internal state can be kept out of the metric collection system to avoid unnecessary memory and network costs.

Hidden metrics are declared by prepending the word hidden to the declaration:

hidden gauge connection_time by pid

Removing session information at the end of the session

The maps can grow unbounded with a key for every session identifier created as the logs are read. If you see mtail consuming a lot of memory, it is likely that there’s one or more of these maps consuming memory.

(You can remove the hidden keyword from the declaration, and let mtail reload the program without restarting and the contents of the session information metric will appear on the exported metrics page. Be warned, that if it’s very big, even loading this page may take a long time and cause mtail to crash.)

mtail can’t know when a map value is ready to be garbage collected, so you need to tell it. One way is to defer deletion of the key and its value if it is not updated for some duration of time. The other way is to immediately delete it when the key and value are no longer needed.

   del connection_time[$pid] after 72h

Upon creation of a connection time entry, the rsyncd.mtail program instructs mtail to remove it 72 hours after it’s no longer updated. This means that the programmer expects, in this case, that sessions typically do not last longer than 72 hours because mtail does not track the timestamps for all accesses of metrics, only writes to them.

   del connection_time[$pid]

The other form indicates that when the session is closed, the key and value can be removed. The caveat here is that logs can be lossy due to problems with the application restarting, mtail restarting, or the log delivery system (e.g. syslog) losing the messages too. Thus it is recommended to use both forms in programs.

  1. del ... after form when the metric is created, giving it an expiration time longer than the expected lifespan of the session.
  2. del form when the session is ended, explicitly removing it before the expiration time is up.

It is not an error to delete a nonexistent key from a map.

Expiry is only processed once ever hour, so durations shorter than 1h won’t take effect until the next hour has passed.