Carbon DMP is a data management platform that enables publishers, agencies, and advertisers to leverage unique 1st party data for more profitable outcomes.
The Carbon platform processes millions of events every day, and accrues tens of thousands of pounds in monthly infrastructure costs across Azure, Amazon Web Services (AWS), Databricks, and SpotInst. The system is comprised of around 100 different services, with almost 1000 resources hosted in Azure.
The responsibilities of the Operations (Ops) team at Carbon can be split roughly into the following categories:
- System Monitoring
- Cost Monitoring
- Issue Handling
In this series, we will go into detail on each of these areas; with examples of how each area applies to the Carbon platform, the technology used, and guidance based on what has worked well for us.
Monitoring is essential to ensuring that the systems we have in place are functioning as expected. Every service we have at Carbon emits metrics that, when combined, enables us to monitor the system as a whole. Examples of these metrics include volume going through the system, sales revenue, and uptime.
We use Grafana as our main data visualization and monitoring tool. Grafana gives us the ability to visualise metrics in a variety of different ways, create dashboards, and share them with the team.
We add alerts to Grafana graphs to notify us of any potential issues with the platform. These alerts normally trigger a Slack message to channels with the relevant people.
Annotations give context to Grafana graphs, we add these either via the UI or using the API, and they are also added automatically when an alert is triggered.
Calculations in Grafana, such as 30-day moving averages, reduce noise and help us spot longer term trends; alerts on these graphs are particularly useful as they may not be viewed as regularly.
Grafana supports multiple data sources. These data sources are mostly time series databases, and can usually be categorised as ‘push’, where metrics are sent to the data source, or ‘pull’, where the data source periodically gets the metrics from different locations.
The data source we use for the majority of our custom metrics is Graphite; a well established ‘push’ time series database.
Originally, each of our apps would push data directly to Graphite. This worked okay, but wasn’t scalable, as each app would have it’s own push logic.
The first step to reduce this duplicate logic, and abstract the database implementation from the apps, was to create an Azure Event Hub. Apps send metrics to this centralised location, and a worker takes metrics off the hub, standardising the way the metrics are sent to Graphite. We also created a NuGet package to simplify how we send metrics to the Event Hub.
When using Graphite, it is important to carefully plan the storage schema and aggregation to ensure that the right level of detail is captured for each time period. Thinking about this upfront means graphs using this data are far quicker to load, and alerts are less likely to timeout.
Azure App Insights
Grafana allows multiple data sources to be combined and used alongside each other in the same graphs and dashboards. In addition to custom metrics pushed to Graphite, we also use Azure Monitor and AWS Cloudwatch as data sources.
Microsoft has invested heavily in Azure Application Insights, a suite of tools for monitoring Azure services, particularly Web apps. At Carbon, we use Azure App Insights heavily.
To highlight a few key features, Azure App Insights includes:
- Live Metrics Stream: A feed of requests to web apps, and continuously updating graphs, which is useful for monitoring changes and debugging,
- Failures: Allows you to drill down into exceptions and uses AI to group together failures and identify common traits,
- Diagnose and Solve Problems: Incredibly useful debugging feature and source of advice,
- Availability Tests: Monitors the uptime of different services, which we use alongside UptimeRobot (which has a good Free tier) to ensure everything is available 24/7, and from different locations.
The most useful feature for us, though, is Analytics, which allows us to query requests using the Kusto query language. Analytics also allows Custom Events to be passed up and queried, and can even access external data. Queries can be saved and shared to allow teams to use the same code on common use cases, and queries are also available in Grafana.
We created an Azure Function which uses the Application Insights API to check the volume of custom events from different clients over the last hour against the same hour the previous week, and alert via a Slack message if the current volume has changed significantly. We use similar logic over a 24 hour period to check for daily spikes or drops.
Check out Part 2 here where we take a closer look at Cost Monitoring; tracking costs across platforms, flagging any potential issues, and digging into the cause of any problems.