Carbon DMP is a next generation intent data management platform that enables publishers, agencies, and advertisers to leverage unique 1st party data for more profitable outcomes.

The Carbon platform processes millions of events every day, and accrues tens of thousands of pounds in monthly infrastructure costs across Azure, Amazon Web Services (AWS), Databricks, and SpotInst. The system is comprised of around 100 different services, with almost 1000 resources hosted in Azure.

The responsibilities of the Operations (Ops) team at Carbon can be split roughly into the following categories:

  1. System Monitoring
  2. Cost Monitoring
  3. Infrastructure 
  4. Issue Handling

In Part 1 of our ‘DevOps at Carbon’ series, we covered System Monitoring, and introduced Graphite, Grafana, and Azure Application Insights; check that out here.

In Part 2, we shared how we monitor, manage, and reduce these costs; check that out here.

In Part 3, we’re going into detail on our infrastructure, including some issues we faced when containerising applications, and how we overcame them.

Ops at Carbon are responsible for ensuring that the infrastructure that Carbon runs on is working as expected, and can be easily monitored, debugged, updated, and backed up.

Deployments

We use AppVeyor to run automated builds, tests, and deployments. AppVeyor also allows us to easily re-deploy to a last known good build if a failure occurs. We receive Slack notifications on build completion or failure. We also use Slack to notify relevant team members of any Azure web app slot swaps in Slack, giving them a complete picture of the deployment process.

SSL

Once APIs are deployed, we need to make sure they’re available and secure. We use LetsEncrypt for free SSL certificates, and automate the process of renewing where we can. Expiring SSL certificates are notorious for causing issues, and we use a variety of techniques to ensure we are not caught out. Certcheckr.com, which has a free tier, is useful in alerting us if certificates are close to expiring.

Azure Autoscale Events

We use Azure Autoscale to ensure our services are always able to handle volumes efficiently. Azure sends out notifications when these Autoscale events take place, and we created an Azure Function to listen via webhooks and send Slack messages, using emoji to get the point across in the simplest way possible. A warning, though: If you get a period with a lot of scale events, things start to look like Saturday Night Fever.

Scale events in Slack

Rancher

So far, we’ve mostly talked about Azure web apps, however we increasingly use docker to improve the scalability of our workloads, and take advantage of spot instance savings via SpotInst. We use Kubernetes to manage these docker containers, and use Rancher to easily interact with the Kubernetes cluster.

When we moved to Kubernetes and Rancher, we were missing some key monitoring abilities that we were used to with Azure, so we had to deploy some extras.

Originally we deployed Prometheus to scrape metrics from each container, however Rancher now comes with built-in monitoring using Prometheus and Grafana.

Rancher allows us to access the logs of a pod very easily via the UI, however these logs are not centralised, which makes it difficult to debug an issue if an app has scaled out to several pods. The logs may also be lost when we lose a node, which can happen frequently when using spot instances.

Centralised logs in Loki

To resolve these issues, we use Loki (https://grafana.com/loki) to centralise the logs in our Kubernetes cluster. Loki keeps logs even if a node dies, and we can query logs using the same syntax that we use for Prometheus. This means if we see a spike in a certain Prometheus metrics, we can instantly see the relevant logs, all from inside the Explore tab in Grafana.

The Kubernetes cluster has become central to our stack, and it’s essential that we backup all the information there in order to recreate the cluster if anything were to happen. We deployed Pieter Lange’s ‘kube-backup’ (https://github.com/pieterlange/kube-backup) to simply backup the cluster configs to a git repository every 10 minutes.

In Part 4, we’ll wrap up the series with a look at Issue Handling, with details on how we keep on top of any bugs that crop up, and how we ensure we’re always able to respond to any customer queries as fast as possible.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *