3 October 2015

Microservices Monitoring

Breaking down a system into more granular services guided by the single responsibility principle does have multiple benefits of bounded context. However, it can also add a degree of complexity that requires more extensive monitoring. With multiple services interaction in a distributed systems context implies multiple log files and a need to aggregate them as well as multiple places for network latency issues to arise. One simple approach is to monitor everything in the entire workflow of the services as well as the system as whole but at same time try to get the bigger picture through an aggregation process. Also, add structure to the logs by utilizing correlation IDs which can then provide a guided trail. The need to be responsive can also be important so real time alerting may also be needed in order to avoid cascaded issues. One can abstract away the service from the system for a monitoring strategy.  The current trend towards monitoring is in a holistic way to get the full picture of the entire system including all its sub-systems as well as all the services interaction within it. A break down of the types of things that can be monitored and examples of tools is given below.

Service-Level Tracking:
  • check inbound response times, error rates, and application metrics
  • check downstream response health, response times of calls, error rates (Hystrix)
  • standardize metrics collection process and pipelines
  • standardize on logging formats so aggregation is easier
  • check system processes for the OS in order to plan for capacity

System-Level Tracking:
  • check host metrics like CPU
  • check system logs and aggregate them so it is possible to filter on individual hosts
  • standardize on single query option for searching through logs
  • standardize on correlation IDs
  • standardize on an action plan and alert levels
  • unify aggregation (Riemann or Suro)

Logstash and Graphite/Collectd/Statsd are also often used in conjunction for the collection and aggregation of logs. One can also apply the ELK stack. The Java Metrics Library can also be utilized to get insights of code during production. There are other tool options available like Skyline and Oculus for anomaly detection and correlation.