Jun 25, 2019

Monitoring : Part 1

While characterizing the performance of Erlang with a distributed problem (see Parallelization), it became obvious that I also needed to better understand how to monitor applications in my own environment. I’m typically developing in a corporate environment that already has monitoring and graphing solutions set up…so this is usually just a task of defining application specific metrics and then instrumenting the application(s) under development. However, right now I am developing applications without that infrastructure in place, and it turns out to be really interesting setting all of this up.

Background

Monitor

So, the first thing I tackled is creating a docker-compose based monitoring and reporting stack:

monitor

‘Monitor’ can be plugged into any build system (Makefile, etc.) and turns on monitoring on my dev box as part of a test run. It collects metrics from both collectd and from prometheus' node-exporter. ‘Monitor’ includes some basic grafana dashboards so that all you have to do is fire up a browser and start looking at the collected metrics in a standard and repeatable way. However, here be (tiny) dragons.

Questions

Plugging in a monitoring stack immediately raised a bunch of questions:

Why are there so many different collectors and stores out there? How are they different?
What are some basic dashboards that would be useful to analyze applications written in Java, Python, Go, and Erlang?
How can I make application monitoring just a standard part of my workflow, with a low friction much like running unit tests?

As I started using grafana and looking at the metrics being surfaced via collectd and prometheus collectors, I couldn’t help but notice that the cpu and memory graphs looked different, and the units didn’t make sense, either. Basically, what I was running into was my full lack of understanding of how metrics were being collected, what they mean, and how to surface them in a meaningful way.

It’s time to deep dive into some code and docs and try to sort this out.

All that and more coming up next in Monitoring : Part 2.