Everyone loves log.

I have a really hard time trying to explain what goes into a useful log message. So, in the aim of trying to get my own thoughts straight on the topic, let's try and figure it out.

The primary difficulty is that we're balancing contradictory requirements. More concretely, it's seemingly to do with the cardinality of the data we're trying to capture.

In an imagined spectrum of cardinality-one to cardinality-burning-s3-to-the-ground, there are few really solid guidelines to follow - cardinality-one data being most likely pointless to include is maybe one of them. Cardinality-firehose data might seem to be another, but that's more nuanced and seems to be where many people, myself included, trip up.

It's tempting to conflate logs and metrics - certainly, they are related, and logs can generate metrics. Datadog, for example, has this exact feature, allowing us to create on the fly aggregate metrics from log events. Scalyr has a similar feature allowing us to graph distributions and so on. All super handy. Which were the most popular countries over the last 24 hours? Can we aggregate P95s over a certain path per window and discard uninteresting log events? Sure thing. There's more, I'm sure, not the least of which is in-process aggregation of events via histograms, counters, gauges etc.

However, logs are not metrics. They're there purely and simply to let us know about what the interesting things were. The uninteresting can be viewed in aggregate.

"Everything is fine" is a primary example of cardinality-lots data, but where we kind of don't really care about the detail. That's a straight up contradiction - we want to capture the exact same data as in other situations, but then we ignore it. Not unreasonably, some devs balk at this, and engage the firehose on the off chance that something in the soup might be useful at some point. The problem here is that the data for "stuff is fine" looks a lot like the data for "stuff is totally not fine".

Another detail is that we don't know that we're going to need that context, until something has committed the "cardinal sin" (I know, I know..) and become interesting. How do we tie things together after the fact in such a way we can follow what happened?

Our job here seems to suggest itself. Systems are not composed of individual atomic events - things happen as a consequence of other events, and what we care about is the minimum of fuss to be able tie together that chain of occurences. Consistency (== low cardinality data) with a handful of unique datapoints, such as request-id / thread-id, seems to be the balance.


Tags: ops logs observability

Copyright © 2021 Dan Peddle RSS
Powered by Cryogen