Monitoring : Part 1
While characterizing the performance of Erlang with a distributed problem (see Parallelization), it became obvious that I also needed to better understand how to monitor applications in my own environment. I’m typically developing in a corporate environment that already has monitoring and graphing solutions set up…so this is usually just a task of defining application specific metrics and then instrumenting the application(s) under development. However, right now I am developing applications without that infrastructure in place, and it turns out to be really interesting setting all of this up.
Background
Monitor
So, the first thing I tackled is creating a docker-compose
based monitoring and reporting stack:
‘Monitor’ can be plugged into any build system (Makefile, etc.) and turns on monitoring on my dev box as part of a test run. It collects metrics from both collectd and from prometheus' node-exporter. ‘Monitor’ includes some basic grafana dashboards so that all you have to do is fire up a browser and start looking at the collected metrics in a standard and repeatable way. However, here be (tiny) dragons.
Questions
Plugging in a monitoring stack immediately raised a bunch of questions:
- Why are there so many different collectors and stores out there? How are they different?
- What are some basic dashboards that would be useful to analyze applications written in Java, Python, Go, and Erlang?
- How can I make application monitoring just a standard part of my workflow, with a low friction much like running unit tests?
As I started using grafana and looking at the metrics being surfaced via collectd and prometheus collectors, I couldn’t help but notice that the cpu and memory graphs looked different, and the units didn’t make sense, either. Basically, what I was running into was my full lack of understanding of how metrics were being collected, what they mean, and how to surface them in a meaningful way.
It’s time to deep dive into some code and docs and try to sort this out.
All that and more coming up next in Monitoring : Part 2.