Monitoring systems

Status: Notes

Data models

Splunk’s model for logs, where it keeps the raw event and annotates it with data added by the host or extracted from the event, is the most usable that I’ve seen in production. It works great with structured logging like JSON but also works when you have text logging.

For metrics, have a metric name, dimensions specifying details like what host it was logged on or percentiles, and the time stamp and value. If possible attach a unit, either in the dimension name or as part of the system.

Send your monitoring data goes over UDP

One of the product managers at work asked why we were sending logging data via UDP and not TCP. After all, TCP is a reliable protocol, right? Why would I send my log entries via an unreliable protocol?

First, how unreliable is UDP? Try pinging a site. I use by default. Quickly trying that, I sent 28 packets and had 0% packet loss. TCP’s reliability is irrelevant here.

If the reliability is irrelevant, UDP is much simpler. On the client side, there is no session to establish. You send one event per datagram, which is not much of a limitation, since and a datagram carries up to 65kb of data. On the server, handling large numbers of TCP connections is hard. Handling UDP datagrams from many sources is trivial.

But there’s still that problem of packet loss. But when does that occur? In the classic case of a bunch of machines logging to another machine on the local network, packet loss implies that something is very wrong with your network. In these cases, another feature of TCP comes into play which means that it won’t help you: throttling.

When TCP connections lose a packet, a request is sent back for the packet to be resent. If lots of packets are being lost, TCP assumes that the network is congested and reduces its transfer rate. If your network has enough trouble where UDP is untenable for local logging, TCP throttles itself almost to a halt. You shouldn’t expect any more events to arrive at the logging server via TCP than via UDP.

Network logging to syslog was built in the days when logging to a server on a local network was the ubiquitous case. Its assumptions may not hold anymore. If you’re logging to somewhere far away—smartphones logging to a server, or a local machine logging to a virtual machine on AWS—then you have the regular level of packet loss of the Internet, which varies widely based on your connection and where you’re sending packets. These are the cases TCP’s throttling was designed to handle, where UDP is a poor choice.

However, TCP may still provide more guarantees than you need. It guarantees that packets appear to the reciever to arrive in the same order that they were sent and proper reassembly of streams of data that span packets. In short, you can regard a TCP socket as a two way channel, ignoring the packets going back and forth underneath. It requires session negotiation and figuring out if a session is still alive. This is a lot to pay for if you don’t need it. If you don’t, the usual route is to implement a thin protocol on top of UDP. John Carmack did this for Quake 3. Plan 9 from Bell Labs uses Reliable UDP for communication between nodes. Many streaming video and audio protocols likewise sit on UDP, as do modern BitTorrent protocols. The design choices between UDP and TCP are endless and depend on the protocol’s intended use.