Monitoring concepts
Purpose
There are several benefits of doing monitoring:
- Make it easier to locate and fix problems in the event of an outage (reduced MTTR)
- Identify trends that could lead to an outage and fix them proactively (increased MTBF)
- Better understanding of the resource usage of an application, which makes it possible to scale resources more appropriately and avoid resource waste
Events vs metrics
Monitoring data can be roughly split into two categories, events and metrics.
Events are typically logs of a single (discrete) event with some information embedded in the event. One HTTP request would be an event with information about the request latency, status, size, user-agent, etc. Visualizing events typically involves aggregating some specific portion of the event data and maybe also cross-reference it with other fields in the event for correlations.
Metrics (also called time-series) is a measurement of a continuous state of something. Memory usage of a process is not something that happens, but is something that is, and can be measured at points in time. For metrics the challenge is to select the interval to measure things. Measure to often and it becomes costly to gather, process and store the data. Measure to seldom and risk not having enough data when trying to identify problems or bottlenecks.