Composite Metrics Tutorial

Composite metrics let you define a new higher-level metric by specifying an arbitrary set of mathematical transformations to perform on a selection of native metrics or time-series you’re sending to Librato. Alternatively you can think of composite metrics as a very powerful way to specify arbitrarily complex queries against your native metrics.

Composite metrics can be saved in charts, and correlated with native metrics in the same chart. You can assign properties like y-axis titles and tool-tip aliases to them just like you do to native metrics.

If you need help building composites, don’t hesitate to contact support or post a question on StackOverflow using the #librato tag.

There are two ways to create composites: in a chart and as a persisted composite.

Composites in a Chart

To create a composite in a chart, create a new chart or edit an existing one, then click on the Composite Metric Editor button to the right of the metric search field. This will open the composite editor.

chart_composite_metric_editor

Persisted Composites

Composites can be saved for using on multiple charts or to use as an alert condition. Head on over to your metrics tab and click on the Create Composite Metric button in the upper right corner.

alert_persisted_composite

From there you can enter in the composite definition and save it so it can be used globally like you would use any other metric.

Composing Metric Queries

The composite metrics editor provides a text field into which you can type a query that will create a composite metric in the graph. Queries are composed of time-series wrapped in functions in the general form:

function(series(metric,tags)))

Note

You cannot nest composite metrics.

Lets take a quick tour of the DSL used to create and work with composite metrics.

Sets and the Series() Function

Composite metrics are created from sets and functions. A time series is a stream of time-aligned measurements of the sort that is identified by a combination of metric name and tag(s) in our metrics platform. A set is a list of time series. For example, if you have a metric called “librato.load.load.shortterm”, and one tag called “host” with three values representing three hosts (server1, server2 and server3) which emit that metric, then the combination of the metric name and tag with a wildcard {"host":"*"} specifies a set of three time series: librato.load.load.shortterm across each of the three different servers.

The series() function is used to retrieve sets of measurements from your existing metrics data. It takes three arguments; the first: metric name, and second: tag set (includes a name/value combination), are required, and the third is optional (we’ll get to that in a moment). Here are some examples using the series() function:

series("librato.load.load.shortterm", {})

… will retrieve all measurements for the librato.load.load.shortterm metric. Empty curly brackets are interpreted as “all tags”.

series("librato.load.load.shortterm", {"host":"*"})

… will retrieve the set of measurements from each value of the host tag (server1, server2 and server3 in our example).

series("librato.load.load.shortterm", {"host":"server1"})

… will retrieve only the set of measurements from server1.

You can use wildcards in the metric name too. So a series() invocation like this…

series("librato.load.*", {"host":"*"})

… will retrieve a set containing the 5, 10, and 15 minute cpu load metrics (shortterm, midterm, and longterm) from all hosts.

Varying the Summary Function

By default the summary statistic function used in each time-series is the average, you can change this by setting the function option to the series call to one of min, max, count, sum:

series("cpu.load.*", {"host":"server*"}, {function:"sum"})

Varying the Resolution

You can also alter the granularity of the summarized data. By default a query for the last 14 days will return time-series with hourly datapoints i.e. period=”3600”. With the period option you can specify an arbitrary resolution in seconds to which the datapoints should be rolled up. For example, to specify daily resolution using sum as the summary function:

series("cpu.load.*", {"host":"server*"}, {function:"sum",period:"86400"})

Shorthand Notation

As every composite metric definition contains at least one invocation of series() and more sophisticated defintions may contain several, our grammar supports s() as an equivalent shorthand notation e.g.:

s("cpu.load.*", {"host":"server*"}, {function:"sum",period:"86400"})

Sets of sets

Finally we can create sets of sets by square-bracketing series() calls. We can use this to specify multiple series as a single argument to functions like divide(), as we’ll see shortly, or to make more specific typeglobs, for example to specify the shortterm and longterm load averages, but not the midterm average we could use:

[s("librato.load.load.shortterm",{"host":"*"}),
s("librato.load.load.longterm",{"host":"*"})]

Dynamic Tags

When you add a metric to a chart, our user interface supports the concept of dynamic tags, which is a fancy way of saying that you’ll specify the value of the tag(s) later on when you actually want to view the data. You can use dynamic tags in your composite metric definitions by using % in lieu of the tag(s) argument to series(). We could make our set of load averages dynamic like so:

[s("librato.load.load.shortterm","%"),
s("librato.load.load.longterm","%")]

Aggregation and Transformation Functions

Composite metrics are created by applying transformation and aggregation functions to the native time-series data returned from series(). Each function takes both a set as input and returns a set as output, so functions can nest each other ad infinitum. Below are examples of commonly used functions.

sum()

The sum() function aggregates the input set down to a single series by adding together the measurements at each time interval. You get a single time series consisting of all the input series added together.

If, for example, you had several metrics that tracked the occurrences of HTTP response codes in your logs, you could get a total count of HTTP 4xx-series errors across all hosts using the sum function like so:

sum(s("prod.log.http.4*",{"host":"*"}))

You could use a set of sets to capture 4xx and 5xx errors like so:

sum([
    s("prod.log.http.4*",{"host":"*"}),
    s("prod.log.http.5*",{"host":"*"})])

subtract()

The subtract() function takes a set of exactly two time series and returns the result of subtracting the second from the first.

Continuing from the last example, if you tracked the occurrences of HTTP response codes in your logs, you could get a count of all non-200 responses by combining sum() and subtract() in a set of sets like so:

subtract([
    sum(s("prod.log.http.*",{"host":"*"})),
    sum(s("prod.log.http.2*",{"host":"*"}))])

max()

The max() function aggregates the input set down to a single series by discarding all but the largest measurement at each time interval. You get a single time series consisting of the largest single measurement from each of the input series.

If you wanted to monitor load average site-wide with a single line, you could use the max() function to display the site-wide maximum load average.

max(s("librato.load.load.*",{"host":"*"}))

min()

The min() function aggregates the input set down to a single series by discarding all but the smallest measurement in each time interval. You get a single time series consisting of the smallest single measurement from each of the input series.

You could add a lower bound to the site-wide load average graph with a call to min() like so…

min(s("librato.load.load.*",{"host":"*"}))

mean()

The mean() function aggregates the input set down to a single series by computing the mean of all measurement in each time interval. You get a single time series consisting of the mean of the input series.

Adding the mean to our site-wide load average graph with mean() like so…

mean(s("librato.load.load.*",{"host":"*"}))

… yields a graph that represents the average CPU load across our entire infrastructure.

derive()

The derive() function is useful when you want to view the rate of change of a given metric - how much it increases or decreases from one measurement to the next. It transforms the input set by computing the derivative of each series in the set. Unlike the functions we’ve discussed thus far, derive() doesn’t aggregate, or combine the input set, you get as many series out as you put in. In order to avoid certain edge cases involving late-arriving data, we recommend that you push calls to derive() to the lowest, or innermost, possible level, e.g:

sum(derive(series()))

is preferable to:

derive(sum(series()))

You can track the rate of customer signups by using derive() to graph the rate of change in a total user count metric.

derive(s("prod.myapp.users.count", {"host":"*"}))

divide()

It’s often useful to calculate the ratio of two metrics. The divide() function takes a set of exactly 2 time series and returns the result of dividing the first by the second (The first set is used as the dividend and the second, the divisor). Continuing the last example, if we were tracking sign-ups and churn on a per-server basis, you could track the ratio of new customers to cancellations by first aggregating all of the per-server stats with sum(), and then dividing the result like so:

divide([
    sum(s("prod.myapp.signups", {"host":"*"})),
    sum(s("prod.myapp.cancels", {"host":"*"}))
])

group_by()

Sometimes you may want to aggregate the results of a particular composite by a set of one or more shared tag names, which you can do with the group_by function. It takes as its first argument a comma-separated list of one or more tag names to group the remaining composite function by, specified by the second parameter. It is as if the enclosed composite function is executed once per unique set of tag values for the grouped tag names. For example, if you wanted insight on user signups based on location, you could group them into similar regions by doing the following:

group_by("region", sum(series("prod.myapp.signups", {"region" : "*east"})))

which is equivalent to the following composite function:

[sum(series("prod.myapp.signups", {"region" : "north-east"})),
 sum(series("prod.myapp.signups", {"region" : "south-east"}))]

A real-world use-case

The Librato Agent reports CPU usage broken and per-type e.g.:

librato.cpu.percent.user
librato.cpu.percent.system

Assume we want to graph the total system CPU usage on a server server-instance-1. We’d start constructing our composite metric by retrieving the native metrics that detail system CPU usage for each core:

series("librato.cpu.percent.system", {"host":"server-instance-1"})

Now we can see how the system CPU usage changes over time, but it’s still in abstract units of time. It’s more useful/intuitive to examine system CPU usage in relation to the total capacity available on the server. We can effect this by taking a ratio of the system CPU usage to the total CPU usage (which include idle). Note that we used the [series1, series2] set notation to specify the inputs to divide():

divide(
[
    series("librato.cpu.percent.system", {"host":"server-instance-1"}),
    sum(series("librato.cpu.percent.*", {"host":"server-instance-1"}))
])

Finally we can make our composite applicable to any server by switching to a dynamic tag and slightly briefer with the s() shorthand for series():

divide(
[
    s("librato.cpu.percent.system", "%"),
    sum(s("librato.cpu.percent.*", "%"))
])

Language Specification

Learn more by taking a quick tour of the DSL used to create and work with composite metrics.

If you have questions about composites, contact support or post a question on StackOverflow using the #librato tag.