Carbon DMP is a data management platform that enables publishers, agencies, and advertisers to leverage unique 1st party data for more profitable outcomes.
The Carbon platform processes millions of events every day, and accrues tens of thousands of pounds in monthly infrastructure costs across Azure, Amazon Web Services (AWS), Databricks, and SpotInst. The system is comprised of around 100 different services, with almost 1000 resources hosted in Azure.
The responsibilities of the Operations (Ops) team at Carbon can be split roughly into the following categories:
- System Monitoring
- Cost Monitoring
- Issue Handling
In Part 1 of our ‘DevOps at Carbon’ series, we covered System Monitoring, and introduced Graphite, Grafana, and Azure Application Insights; check that out here.
In Part 2, we’re sharing details on how we monitor, manage, and reduce these costs.
When working in the cloud, it is easy to underestimate the cost of a new feature, or simply forget to turn something off. It is important to keep costs under control, so we built a system that pulls in daily costs from each of our main infrastructure vendors, created a Grafana dashboard of costs per platform per day, and added alerts that have already caught several issues.
Azure, AWS, and SpotInst all have APIs available which are able to return cost per day figures. We created an Azure Function to get these costs from each API, convert them to GBP, and send them to Graphite. One thing to keep in mind when working with these figures is that it may be a few days before these costs finalise; so we overwrite the last week or so of data each day.
Originally we thought it best to have as much detail as possible in the Grafana cost charts. However, in the last year or so Azure Cost Analysis and the AWS Cost Explorer have improved greatly, to the point where it makes more sense to keep just the totals in Grafana, and go to the platform’s built in tooling to explore costs further.
Having these total cost figures in Grafana also allows us to combine them with other metrics throughout the system. For example, if we were to assume that the cost should increase roughly in line with the volume of pagevisits processed, we can create a graph that calculates and tracks the Cost Per Million PageVisits, and alert us if this rises above an expected value.
Tagging resources in Azure and AWS is key to identifying problem areas when using these tools. At Carbon, we add ‘feature’, ‘owner’, and ‘environment’ tags to every resource.
Storage costs in the cloud tend to increase continuously unless actively monitored. Deleting unused data and taking advantage of lifecycle policies such as those in AWS buckets and Azure’s general-purpose v2 storage accounts is highly recommended. Be aware that Azure’s general-purpose v2 storage accounts may not be cost effective if the data stored is very frequently accessed; as always, use the cost calculators available.
In Part 3 we’ll take a closer look at Infrastructure; introducing Rancher to manage Kubernetes clusters, Loki to centralise logging, and going into detail on how we keep on top of deployments, SSL certificates, and scaling.