Carbon DMP is a next generation intent data management platform that enables publishers, agencies, and advertisers to leverage unique 1st party data for more profitable outcomes.

The Carbon platform processes millions of events every day, and accrues tens of thousands of pounds in monthly infrastructure costs across Azure, Amazon Web Services (AWS), Databricks, and SpotInst. The system is comprised of around 100 different services, with almost 1000 resources hosted in Azure.

The responsibilities of the Operations (Ops) team at Carbon can be split roughly into the following categories:

  1. System Monitoring
  2. Cost Monitoring
  3. Infrastructure 
  4. Issue Handling

In Part 1 of our ‘DevOps at Carbon’ series, we covered System Monitoring, and introduced Graphite, Grafana, and Azure Application Insights; check that out here.

In Part 2, we shared how we monitor, manage, and reduce our infrastructure costs; check that out here.

In Part 3, we went into detail on our infrastructure, including some issues we faced when containerising applications, and how we overcame them. Check that out here.

In the final part of the series, we take a look at how we are alerted of issues, how we go about resolving them, and how we improve our systems to ensure issues do not reoccur.

Issue Handling

Ops at Carbon are responsible for handling any bugs and issues that affect the platform, and communication around support tickets submitted by clients, as well as ensuring that the platform is operating within any Service Level Agreements (SLAs) that are in place, and organising who is on-call to deal with issues.

Issues at Carbon can roughly be split into two main categories: systems we monitor with metrics and Grafana or Azure alerts, and client requests and issues that are submitted through the Carbon dashboard or via an email to support.

Alerting

All issues, internal or from a client, alert the team via automated Slack messages. For urgent or out-of-hours issues, we use PagerDuty to alert on-call staff so we can resolve issues and respond to clients as soon as possible. If an issue isn’t acknowledged within 10 minutes by the primary on-call user, we alert the second user, and so on. We’re also starting to use Ovvy to organise the PagerDuty on-call schedules easily from within Slack.

Example PagerDuty slack messages

Resolving

Once an alert is acknowledged, we aim to resolve it as fast as possible. We use a Slack channel throughout the resolution of issues to organise communication within the team and to clients. Currently all major issues are organised in a single channel, but we are experimenting with Monzo’s new open source tool Response to create new Slack channels for each incident.

To reduce the time taken to resolve issues with our services, we document steps to debug common problems in runbooks. The structure of these runbooks is again borrowed from Monzo:

  • Who can help? — Who can I talk to if I have questions with this process? Who can I escalate to?
  • Symptoms – How can I quickly tell that this is what is going on?
  • Pre-checks – What checks can I perform to be 100% sure this is a sensible guide to follow?
  • Resolution – What do I have to do to fix it?
  • Post-checks – How can I be 100% sure that it is solved?
  • Rollback – (Optional) How can I undo my fix?

Another tool that we have found useful when debugging is a ‘change-calendar’ Slack channel: a low-friction way to put out a notification of any significant changes to the system, regardless of the cause. This could be a deployment, a change in client behaviour, or downtime of an external system, for example. This has proven to be incredibly useful when identifying the cause of changes when look at historical data.

Optional message template for change calendar messages

Improving

We use Clubhouse to keep track of the status of issues and make sure they are resolved as fast as possible. After resolving an issue, we then reduce the likelihood of the problem reoccurring by ensuring that all issues are eventually moved to a “System Improved” state. This normally takes the form of additional testing being written, metrics being sent or alerts created, a change in process, or a runbook being updated.

Ops stack

We hope you have enjoyed this series on Ops at Carbon. If you’re interested in working with us, we’re hiring.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *