Monitoring : Part 1

While characterizing the performance of Erlang with a distributed problem (see Parallelization), it became obvious that I also needed to better understand how to monitor applications in my own environment. I’m typically developing in a corporate environment that already has monitoring and graphing solutions set up…so this is usually just a task of defining application specific metrics and then instrumenting the application(s) under development. However, right now I am developing applications without that infrastructure in place, and it turns out to be really interesting setting all of this up.

Background

Monitor

So, the first thing I tackled is creating a docker-compose based monitoring and reporting stack:

monitor

‘Monitor’ can be plugged into any build system (Makefile, etc.) and turns on monitoring on my dev box as part of a test run. It collects metrics from both collectd and from prometheus’ node-exporter. ‘Monitor’ includes some basic grafana dashboards so that all you have to do is fire up a browser and start looking at the collected metrics in a standard and repeatable way. However, here be (tiny) dragons.

Questions

Plugging in a monitoring stack immediately raised a bunch of questions:

  • Why are there so many different collectors and stores out there? How are they different?
  • What are some basic dashboards that would be useful to analyze applications written in Java, Python, Go, and Erlang?
  • How can I make application monitoring just a standard part of my workflow, with a low friction much like running unit tests?

As I started using grafana and looking at the metrics being surfaced via collectd and prometheus collectors, I couldn’t help but notice that the cpu and memory graphs looked different, and the units didn’t make sense, either. Basically, what I was running into was my full lack of understanding of how metrics were being collected, what they mean, and how to surface them in a meaningful way.

It’s time to deep dive into some code and docs and try to sort this out.

All that and more coming up next in Monitoring : Part 2.