Querying traces with SQL: examples

One of the unique characteristics of Promscale is that it allows you to use SQL to query your observability data. Concretely, Promscale stores OpenTelemetry traces in a relational model: you can query your traces as you will do with any other relational table, using all the TimescaleDB functionality.

This allowa you to use traces in a more sophisticated way than if you were using only Jaeger or Grafana Tempo. In this post, you have some examples of questions you could ask by directly querying the ps_trace database schema in Promscale :fire:

The data we use in these examples comes from a Kubernetes cluster running Honeycomb’s fork of Google’s microservices demo, that uses OpenTelemetry instrumentation.

Query 1: “List the top operations with the highest error rate in the last hour”

Response time, throughput, and error rate are the key metrics used to assess the health of a service. In particular, high error rate and response time are key indicators of the experience we are delivering to our users and need to be tracked closely. In this example, we look at how we can use OpenTelemetry traces and SQL to identify the main sources of errors in our services.

In Promscale, every span has an attribute that indicates the name of the service (service_name) and an attribute that indicates the name of the operation (name). (Technically, the service name in OpenTelemetry is not a span attribute but a resource attribute. For the purposes of querying traces in Promscale, we can think of service name as a span attribute as well.)

Every span has another attribute that indicates whether the span resulted in an error or not: status_code. If there is an error, the value of status_code is STATUS_CODE_ERROR.

Our goal is to write a query that will return one row per individual operation with the total number of executions (number of spans), the total number of executions that led to an error (spans where the status code is an error), and the percentage of spans with an error for the top 10 operations with more errors. Something like this:

    span_name as operation,
    COUNT(*) FILTER (WHERE status_code = 'STATUS_CODE_ERROR') as spans_with_error,
    COUNT(*) as total_spans,
    TO_CHAR(100*(CAST(COUNT(*) FILTER (WHERE status_code = 'STATUS_CODE_ERROR') AS float) / count(*)), '999D999%') as error_rate
FROM span
     start_time > NOW() - INTERVAL '1 hour'
GROUP BY service_name, operation
ORDER BY error_rate DESC

As you can see, the query uses the standard SQL syntax we are all familiar with: SELECT, FROM, WHERE, GROUP BY, ORDER BY

In the SELECT clause we project the service name and the operation (which corresponds to the span name). These are also the two attributes we use to aggregate the results in the GROUP BY clause.

The other three columns we project are the number of spans with an error, the total number of spans and the error rate within each error group. To do it we use some nifty SQL capabilities available in PostgreSQL:

  • COUNT (*) which returns the total number of rows, spans in this case, in a group.
  • COUNT (*) FILTER which returns the total number of spans in a group matching a certain criteria. In our case we want spans with an error which are indicated by the value STATUS_CODE_ERROR in the status_code attribute.
  • CAST to convert the number of spans with error to a floating point number so when it’s divided by the total count of spans it returns a floating number and not an integer. If we don’t do this then the number will be converted to the closest integer which will be 0 since that division is always less than 0.
  • TO_CHAR to convert the error rate to an easy-to-read percentage number.

Below is an example of results from this query:

service_name |             operation             | spans_with_error | total_spans | error_rate
 frontend     | /cart/checkout                    |               12 |         345 |    3.478%
 frontend     | hipstershop.AdService/GetAds      |               60 |        5214 |    1.115%
 adservice    | hipstershop.AdService/GetAds      |                1 |        5214 |     .019%
 cart         | grpc.health.v1.Health/Check       |                0 |         707 |     .000%
 cart         | hipstershop.CartService/EmptyCart |                0 |         345 |     .000%
 cart         | hipstershop.CartService/GetCart   |                0 |        7533 |     .000%
 checkout     | SQL SELECT                        |                0 |         361 |     .000%
 checkout     | grpc.health.v1.Health/Check       |                0 |         718 |     .000%
 checkout     | getDiscounts                      |                0 |         345 |     .000%
 checkout     | hipstershop.CartService/EmptyCart |                0 |         345 |     .000%

This immediately indicates that we need to take a closer look at the code behind the /cart/checkout operation, since it has a very high error rate - most likely leading to many lost sales!

Another thing that drives our attention in these results is the second and third rows. What’s surprising about them is that the error rate on the client-side of the request (frontend - hipstershop.AdService/GetAds) is much higher than the error rate on the server-side of the same request (adservice - hipstershop.AdService/GetAds). This means that the adservice is successfully completing the request, but something is happening when transferring the response back to the frontend service.

Before we move on to the next query example, let’s look at a different way to write the query that leverages additional SQL capabilities. To calculate the error rate in the query above, we are using twice the functions required to count spans with error and total spans. This can lead to inconsistencies if we update the query to change the way we calculate the error rate but don’t apply those changes everywhere. To avoid that, we can use SQL subqueries:

  TO_CHAR(100*(CAST(spans_with_error AS float)) / total_spans, '999D999%') as error_rate
        span_name as operation,
        COUNT(*) FILTER (WHERE status_code = 'STATUS_CODE_ERROR') as spans_with_error,
        COUNT(*) as total_spans
    FROM span
        start_time > NOW() - INTERVAL '1 hour'
    GROUP BY service_name, operation
) AS error_summary
ORDER BY error_rate DESC

This query first builds a dataset (error_summary) with service name, operation, spans with error, and total spans, and then it uses the values calculated in that dataset to compute the error rate avoiding the duplication in the initial query. This change doesn’t impact the performance of the query.

An additional consideration is that the query searches across all spans, and depending on the environment and the amount of instrumentation, you could see some duplicative results because of parent-child relationships between spans. In those cases, it could be better to start by looking at traces (complete request) that had an error. You can do that by only searching across root spans by adding an additional condition to the where clause:

WHERE start_time > NOW() - INTERVAL '1 hour' AND parent_span_id = 0

Query 2: “List the top slowest operations in the last hour”

At the beginning of the previous example, we identified response time and error rate as two key indicators of the experience we deliver to our users. Let’s see now how we can quickly determine bottlenecks in our services.

All spans contain a duration attribute that indicates how long it took to execute that span. To analyze the duration, we will look at several statistics and in particular percentiles.

In this example, our goal is to write a query that will return one row per individual operation with several percentiles (99.9th, 99th, 95th) and the average of the duration in milliseconds. For this one, we will search across root spans, so we see which user requests have the worse response time:

    span_name as operation,
    ROUND(approx_percentile(0.999, percentile_agg(duration_ms))::numeric, 3) as duration_p999,
    ROUND(approx_percentile(0.99, percentile_agg(duration_ms))::numeric, 3) as duration_p99,
    ROUND(approx_percentile(0.95, percentile_agg(duration_ms))::numeric, 3) as duration_p95,
    ROUND(avg(duration_ms)::numeric, 3) as duration_avg
FROM span
    start_time > NOW() - INTERVAL '1 hour' AND
    parent_span_id = 0
GROUP BY service_name, operation
ORDER BY duration_p99 DESC

This query uses TimescaleDB’s approx_percentile hyperfunction to calculate the different percentiles. We also use the ROUND function to limit the number of decimals shown in each column, plus we need to use ::numeric to convert the return value of approx_percentile, which is double-precision, to a type supported by the ROUND function, numeric in this case.

 service_name |              operation              | duration_p99 | duration_p999 | duration_p95 | duration_avg
 frontend     | /cart/checkout                      |    20658.359 |     20658.359 |     1238.359 |      600.609
 adservice    | AdService.start                     |    13319.743 |     13319.743 |    13319.743 |    13307.008
 cart         | grpc.health.v1.Health/Check         |     1843.628 |      1843.628 |      911.864 |      132.689
 frontend     | /                                   |     1159.651 |      1159.651 |      673.087 |      286.100
 frontend     | /product/{id}                       |      752.567 |       752.567 |      307.197 |      118.877
 frontend     | /cart                               |      582.594 |       582.594 |      270.289 |       92.472
 checkout     | hipstershop.CurrencyService/Convert |      193.229 |       193.229 |      101.888 |       30.177
 frontend     | /setCurrency                        |        7.504 |         7.504 |        0.908 |        0.498
 payment      | grpc.grpc.health.v1.Health/Check    |        1.615 |         1.615 |        0.172 |        0.386
 currency     | grpc.grpc.health.v1.Health/Check    |        1.355 |         1.355 |        0.140 |        0.420

If we look at the results, we can for example quickly see that there seems to be an issue with the /cart/checkout endpoint. Note that if we just looked at the averag, we would think performance is good (600 ms). Even if we look at the 95% percentile performance, it still looks acceptable at 1.2 seconds. However, when we look at the 99% percentile, we can see the performance is extremely poor (20 seconds!). This means that between 1% and 5% of the requests delivered a very poor user experience. This is definitely something worth investigating further.

As we can see, the results of the query are showing the performance of automated health checks. That may be useful in some scenarios, but it’s not something our users experience and we may want to filter them out. To do it, we would just add an additional condition to the where clause:

start_time > NOW() - INTERVAL '1 hour' AND
parent_span_id = 0 AND
span_name NOT LIKE '%health%'

Query 3: “Identify what services generate more load on other services by operation in the last 30 minutes”

In microservice architectures, there are many internal calls between services. One of those microservices could be going through a lot of load; that may be a service that is used by several other services for performing different operations, and we would not immediately know what’s causing that.

Using the query below, we can quickly list all dependencies across services, and get an understanding of not only who is calling who and how often - but also what are the specific operations involved in those calls, and how long the execution of those calls is taking in aggregate (in seconds). This serves as an additional indicator of load, since some types of requests could be much more expensive than others.

    client_span.service_name AS client_service,
    server_span.service_name AS server_service,
    server_span.span_name AS server_operation,
    count(*) AS number_of_requests,
    ROUND(sum(server_span.duration_ms)::numeric) AS total_exec_time
    span AS server_span
    JOIN span AS client_span
    ON server_span.parent_span_id = client_span.span_id
    client_span.start_time > NOW() - INTERVAL '30 minutes' AND
    client_span.service_name != server_span.service_name
    number_of_requests DESC;

In this query, we are leveraging another powerful capability of SQL: joins. We are joining the span table with itself to identify only the spans that represent a call between two services. This is what the condition client_span.service_name != server_span.service_name accomplishes.

The result would look something like the following:

client_service | server_service |                    server_operation                     | number_of_requests | total_exec_time
 frontend       | adservice      | hipstershop.AdService/GetAds                            |               2672 |               1
 frontend       | cart           | hipstershop.CartService/AddItem                         |                509 |               5
 checkout       | cart           | hipstershop.CartService/EmptyCart                       |                174 |               2
 frontend       | cart           | hipstershop.CartService/GetCart                         |               3697 |              25
 checkout       | cart           | hipstershop.CartService/GetCart                         |                174 |               2
 frontend       | checkout       | hipstershop.CheckoutService/PlaceOrder                  |                174 |              57
 frontend       | currency       | grpc.hipstershop.CurrencyService/Convert                |               8635 |              14
 checkout       | currency       | grpc.hipstershop.CurrencyService/Convert                |                408 |               1
 frontend       | currency       | grpc.hipstershop.CurrencyService/GetSupportedCurrencies |               3876 |               3
 checkout       | email          | /hipstershop.EmailService/SendOrderConfirmation         |                174 |               0
 checkout       | payment        | grpc.hipstershop.PaymentService/Charge                  |                174 |               0
 frontend       | productcatalog | hipstershop.ProductCatalogService/GetProduct            |              20436 |               1
 checkout       | productcatalog | hipstershop.ProductCatalogService/GetProduct            |                234 |               0
 frontend       | productcatalog | hipstershop.ProductCatalogService/ListProducts          |                501 |               0
 frontend       | recommendation | /hipstershop.RecommendationService/ListRecommendations  |               3374 |              15
 frontend       | shipping       | hipstershop.ShippingService/GetQuote                    |               1027 |               0
 checkout       | shipping       | hipstershop.ShippingService/GetQuote                    |                174 |               0
 checkout       | shipping       | hipstershop.ShippingService/ShipOrder                   |                174 |               0

These results show the client service making the request in the first column, and the service and operation receiving the request in the second and third columns. We see, for example, that the GetProduct method of the ProductCatalogService has been called from the frontend service, and the checkout service and that the former made many more calls in the last 30 minutes - which is expected and not an issue in this case. But if we saw a much higher percentage of calls to the GetProduct method originating from the checkout service, this would be an indicator of something unexpected going on.

This information was originally published in this blogpost. Check it out for further insights!