Setting up the objectives
We need to define the scope of what we want to accomplish through instrumentation. We'll keep it small by limiting ourselves to a single goal. We'll scale services if their response times are over an upper limit and de-scale them if they're below a lower limit. Any other alert will lead to a notification to Slack. That does not mean that Slack notifications should exist forever. Instead, they should be treated as a temporary solution until we find a way to translate manual corrective actions into automated responses performed by the system.
A good example of alerts that are often treated manually are responses with errors (status codes 500 and above). We'll send alerts whenever they reach a threshold over a specified period. They will result in Slack notifications that will become pending tasks for humans. An internal rule should be to fix the problem first, evaluate why it happened, and write a script that will repeat the same set of steps. With such a script, we...