Mapbox team spotlight: Platform
By: Devin Boyer
All engineering teams at Mapbox share a few common principles. Two of these include being fully responsible for monitoring their own systems’ health and reliability and using declarative infrastructure to define how software is built and deployed. The Mapbox Platform team, which builds developer tools and supports shared cloud infrastructure for Mapbox developers, recently codified these principles. We built a tool called on-call, which is used to help teams receive PagerDuty alarms when things go wrong and to provide a handful of Slack integrations which can be used to gain information about a team’s on-call configuration.
Challenge: Managing thousands of alarms
As an organization which runs entirely on the AWS Cloud, we primarily use AWS CloudWatch Alarms to tell us when systems are not behaving correctly. As teams create and update services, they use CloudFormation to create CloudWatch alarms alongside the other resources necessary to run their applications. Across all Mapbox services, we have nearly 3,000 unique alarms deployed to monitor our production systems and alert teams to failures. Service teams use PagerDuty to manage an on-call rotation and route alarms to the person currently carrying the “pager”.
While it’s relatively straightforward to create an AWS SNS topic with an email subscription which will trigger alarms on a PagerDuty service, users are required to manually click a “Confirm Subscription” link in the PagerDuty interface after creating the subscription. Failure to do so could and has resulted in missed pages. Because Mapbox application stacks tend to be self-contained, this meant every new piece of backend software deployed resulted in a new SNS topic and a new Subscription to confirm.
Solution: on-call PagerDuty Integration
We built on-call to eliminate this burden. on-call provides some glue between the various interfaces to our incident response tools, by providing a common way for teams to route alarms to their configured PagerDuty service.
PagerDuty offers a native integration with CloudWatch which eliminates the need to manually confirm SNS topic subscriptions. The integration provides a webhook URL which is added as a topic subscription to receive alarms. So that teams don’t have to manually copy-paste these webooks into their various templates or stack configurations, we designed on-call to create a single SNS topic for each team which is subscribed to their specified PagerDuty service. This topic is created by a team adding their PagerDuty service ID to the on-call application configuration. The Platform team runs a deploy in every AWS region where we operate, including AWS China, which creates the configured SNS topic for the team. This SNS topic ARN (Amazon Resource Name) is then exposed as a CloudFormation export value for use in application stacks. (Read more in the AWS documentation.)
For example, the Platform team can subscribe an alarm to their PagerDuty configuration by specifying AlarmActions as follows:
This snippet makes use of the open-source Mapbox node module cloudfriend, which enables developers to write composable CloudFormation templates using JavaScript.
We wire all of this together using a CloudFormation Custom Resource. Custom Resources are handy ways of extending CloudFormation to either fill in gaps with features AWS has not yet provided, or to build integrations with other services, as we’ve done here, using Lambda functions. When a new team adds their configuration to on-call, the custom resource lambda function uses the PagerDuty API to create a properly configured CloudWatch integration, if it does not already exist, including setting a few configuration overrides we’ve found useful — like setting incident titles to the actual CloudWatch alarm name. We then return the webhook URL as an attribute which can be referenced by other parts of the template. In our case, as the value for the SNS topic subscription. As the organization grows, we’ve seen how this CloudWatch to PagerDuty integration makes it very easy for new teams to get set up to provide an operationally excellent service.
Slack commands
We also built three custom Slack commands as part of the on-call project. The first is an automated bot which will post who is going on- and off-call whenever a shift change occurs. This provides a nice visual reminder of who the on-call is for a given team, including when any scheduled overrides are taking place.
To identify who is presently on-call for a given team, anyone in our organization can use the @mapbox on-call Slack command. We namespace several internal Mapbox tools under the @mapbox command. Similar to the output from the handoff message, the on-call subcommand will list who is currently on-call for all levels of a PagerDuty escalation policy where a rotation (more than one person) is set up.
The final notable Slack tool that makes up the on-call suite of tools is the @mapbox alarm command. This command is used to trigger an alarm on any specified PagerDuty service. This can be used to alert service teams of a critical situation which was not or could not be caught via a standard metric alarm.
Conclusion
on-call provides a nicely scoped set of incident response tools that are widely used across our organization. The CloudFormation functionality makes it easy for teams to set up and receive properly-configured alarms while the Slack commands provide common patterns to trigger an incident when necessary or identify who is on-call for a particular team or responsibility. In fact, some teams at Mapbox have used this tooling to create schedules in PagerDuty which reflect other rotating responsibilities like issue queue gardening.
Our on-call system is fairly mature and requires very little ongoing maintenance. One adjacent area of future work we’re considering is automatic deployment of new team additions in all AWS regions as the company grows.
Do you like building internal tools like this to help developers ship more effectively? 🚀 Do the acronyms CI/CD bring joy to your containerized heart? The Mapbox Platform team is hiring for multiple roles, and we’d love for you to join us! Apply here.
Building on-call: Mapbox’s managed incident response tool was originally published in Points of interest on Medium, where people are continuing the conversation by highlighting and responding to this story.