CloudWatch does not aggregate across dimensions for your custom metrics

red888 picture red888 · Jan 25, 2018 · Viewed 7.9k times · Source

Reading the docs I saw this statement;

CloudWatch does not aggregate across dimensions for your custom metrics

That seems like a HUGE limitation right? It would make custom metrics all but useless in my estimation- so I want to confirm I'm understanding this.

For example say I had a custom metric I shipped from multiple servers. I want to see per server but I also want to see them all together. I would have no way of aggregating that accross all the servers? Or would i be forced to create two custom metrics, one for single server and one for all server and double post metrics from the servers to the per server one AND the one for aggregating all of them?

Answer

Dejan Peretin picture Dejan Peretin · Jan 26, 2018

The docs are correct, CloudWatch won't aggregate across dimensions for your custom metrics (it will do so for some metrics published by other services, like EC2).

This feature may seem useful and clear for your use-case but it's not clear how such aggregation would behave in a general case. CloudWatch allows for up to 10 dimensions so aggregating for all combinations of those may result in a lot of useless metrics, for all of which you would be billed. People may use dimensions to split their metrics between Test and Prod stacks for example, which are completely separate and aggregating those would not make sense.

CloudWatch is treating a metric name plus a full set of dimensions as a unique metric identifier. In your case, this means that you need to publish your observations for each metric you want it contributing to separately.

Let's say you have a metric named Latency, and you're putting a hostname in a dimension called Server. If you have three servers this will create three metrics:

  • Latency, Server=server1
  • Latency, Server=server2
  • Latency, Server=server3

So the approach you mentioned in your question will work. If you also want a metric showing the data across all servers, each server would need to publish to a separate metric, which would be best to do by using a new common value for the Server dimension, something like AllServers. This will result in you having 4 metrics, like this:

  • Latency, Server=server1 <- only server1 data
  • Latency, Server=server2 <- only server2 data
  • Latency, Server=server3 <- only server3 data
  • Latency, Server=AllServers <- data from all 3 servers

Update 2019-12-17

Using metric math SEARCH function: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html

This will give you per server latency and latency across all servers, without publishing a separate AllServers metric and if a new server shows up, it will be automatically picked up by the expression:

Graph source:

{
    "metrics": [
        [ { "expression": "SEARCH('{SomeNamespace,Server} MetricName=\"Latency\"', 'Average', 60)", "id": "e1", "region": "eu-west-1" } ],
        [ { "expression": "AVG(e1)", "id": "e2", "region": "eu-west-1", "label": "All servers", "yAxis": "right" } ]
    ],
    "view": "timeSeries",
    "stacked": false,
    "region": "eu-west-1"

}

Result will be a graph like this:

search expression

Downsides of this approach:

  • Expressions are limited to 100 metrics.
  • Overall aggregation is limited to available metric math functions, which means percentiles are not available as of 2019-12-17.

Using Contributor Insights (open preview as of 2019-12-17): https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html

If you publish your logs to CloudWatch Logs in JSON or Common Log Format (CLF), you can create rules that keep track of top contributors. For example, a rule that keeps track servers with latencies over 400 ms would look something like this:

{
    "Schema": {
        "Name": "CloudWatchLogRule",
        "Version": 1
    },
    "AggregateOn": "Count",
    "Contribution": {
        "Filters": [
            {
                "Match": "$.Latency",
                "GreaterThan": 400
            }
        ],
        "Keys": [
            "$.Server"
        ],
        "ValueOf": "$.Latency"
    },
    "LogFormat": "JSON",
    "LogGroupNames": [
        "/aws/lambda/emf-test"
    ]
}

Result is a list of servers with most datapoints over 400 ms:

enter image description here

Bringing it all together with CloudWatch Embedded Format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html

If you publish your data in CloudWatch Embedded Format you can:

  • Easily configure dimensions, so you can have per server metrics and overall metric if you want.
  • Use CloudWatch Logs Insights to query and visualise your logs.
  • Use Contributor Insights to get top contributors.