Alerting
Alerting
Alerting is the final layer of your observability stack, and it only works well once you have monitoring and logging in place. You need monitoring set up in order to alert on increased resource usage; you need logging set up in order to alert on specific errors.
What to Alert On
From Monitoring
Your monitoring data feeds alerts for the most common operational signals:
- Response times -- Alert when page or API response times exceed your acceptable threshold.
- Error rates -- Alert when the proportion of failed requests rises above a baseline.
- Resource usage -- Alert when CPU, memory, or disk consumption approaches capacity.
From Logging
You can derive metrics from your logs and alert on those as well. For example, you might log every failed payment attempt and then create a log metric that fires an alert when the failure count exceeds a certain threshold within a given time window. This kind of alert is particularly useful for business-critical flows that your standard monitoring dashboards may not cover.
Alerting Tools
Several well-established tools handle alerting at scale:
- PagerDuty -- Industry-standard incident management and on-call routing.
- Opsgenie -- Alert consolidation and escalation (now part of Atlassian).
- VictorOps -- On-call collaboration and incident response (now part of Splunk).
If you are running on a cloud provider, you can also use their native alerting services -- AWS CloudWatch Alarms, Azure Monitor Alerts, or Google Cloud Alerting -- to keep your stack tightly integrated.
On-Call Alerting
If your application runs around the clock, alerting becomes essential for maintaining an on-call rota. A tool like PagerDuty can route alerts to the on-call engineer via SMS, email, or phone call, ensuring that someone is always available to respond when something goes wrong.
This is a must for any e-commerce site or 24-hour service. That said, it only works if the underlying logging and monitoring are set up correctly -- noisy or misleading alerts will erode trust in the system far faster than no alerts at all.