Devopsdays NYC 2020 Demo, Open Space Recap & More
We recently attended the NYC installment of the devopsdays event series (thank you to the local organizers and volunteers!), where we met with community members interested in all things monitoring, infrastructure, software development, and CI/CD.
Given the cancellation of many industry events to ensure public safety and mitigate COVID-19’s spread (check out our blog post if you’re interested in monitoring it yourself), we’re sharing a bit about our recent experience—what we learned, what we demoed, and what we spoke about—to bring the event experience to the wider community.
During the event, I demoed how to use TimescaleDB as a long-term store for Prometheus metrics, combining Prometheus, TimescaleDB, and Grafana to monitor a piece of critical infrastructure (in this case, a database). This sort of create-your-own flexibility and customization is becoming more and more common in my conversations with developers, and this demo allows you to create a monitoring stack that suits your needs without adding significant costs.
Why this scenario? I was inspired by one of our customers, who uses TimescaleDB to store and analyze their Prometheus metrics. They told us how it not only saves them money and disk space but also allows them to keep their data around and see trends over longer periods.
See the demo in action
You’ll notice a Grafana dashboard visualizing metrics, with TimescaleDB as the data source powering the dashboard. I focused on the below basic monitoring metrics, but if you try it yourself, you can customize and add more metrics that give you more insight (e.g., query latency, queries per second, open locks, cache hits, etc.):
- CPU usage
- Service status
- % of Disk used
- # of Database connections
- % Memory used
- Network Status
To replicate the demo, follow these tutorials on how to store Prometheus metrics in Timescale and how to use Timescale as a data source to power Grafana dashboards.
Open Space: DevOps & Data
Devopsdays “Open Spaces” are a (wonderful) concept similar to an unconference format: there’s a block of time scheduled for any attendees to discuss topics of their choosing with other interested attendees. Simply propose a topic to the audience that you’d like to discuss for 30 minutes, and other attendees can pick and choose which sessions they’d like to attend.
Fellow Timescaler Matvey Arye and I hosted an Open Space session about DevOps Data, and other topics ranged from negotiating pay and other soft skills to DevOps in small companies and DevOps in a certain ecosystem (AWS, Microsoft Azure, Google Cloud, etc.).
In our session, we heard stories, best practices, and how developers from all industries and areas think about the DevOps data they collect.
A few highlights and commonalities
Teams are moving away from managing infrastructure themselves and toward managed services (as one person put it: “One of the key criteria, when we select a new tool, is that we want one less thing to manage”).
Data is becoming increasingly central in how teams fuel their post-mortem problem analysis. Developers collect data about critical incidents, search for patterns in what’s causing them, and correlate this information with how it impacts clients or users.
One team’s best practice and advice (they manage a massive consumer messaging app): Take snapshots of high load periods. This way, you get more detailed information to use for planning and to calibrate for the following years. In this team’s case, the New Year’s Eve timeframe is when they see the highest number of messages sent across their global user base.
Kubernetes, as always, was a hot topic. Two common pain points stood out (and are things that we can relate to as we build our Kubernetes deployment and multi-node offerings):
- #1: Visibility of what’s happening inside clusters and pods. Someone summed it up with, “I don’t just want to know my pod is offline, I want to know what was going on inside it.” We couldn’t agree more.
- #2: Aggregate observability data across clusters to simplify things for Ops teams who handle metrics from multiple applications teams.
Questions & Conversations
To me, the best part of any conference is the hallway conversations and hearing the things community members are keen to learn. As a company, we’re help-first, so in the spirit of helping, here are a few questions I heard again and again that may be relevant as you get up and running or do more advanced things with TimescaleDB:
How does TimescaleDB perform at scale?
TimescaleDB scales up well. In our internal benchmarks on standard cloud VMs, we regularly test TimescaleDB to 10+ billion rows while sustaining insert rates of 100-200k rows per second (1-2 million metric inserts / second).
While running on more powerful hardware, we’ve seen users scale a single-node setup to 500 billion rows of data while sustaining 400k row inserts per second. To learn more about how TimescaleDB is architected to achieve this scale, see this blog explainer.
What’s the role of a long-term data store? What types of things does this allow me to do?
In order to keep Prometheus simple and easy to operate, its creators intentionally left out some of the scaling features developers typically need. Prometheus stores data locally within the instance and is not replicated. While having both compute and data storage on one node makes it easier to operate, it also makes it harder to scale and ensure high availability.
More specifically, this means Prometheus data isn’t arbitrarily scalable or durable in the face of disk or node outages.
Simply put, Prometheus isn’t designed to be a long-term metrics store. However, its creators also made Prometheus extremely extensible, and thus, you can use TimescaleDB to store metrics for longer periods of time, which helps with capacity planning and system calibration.
This combination also enables high-availability and provides advanced capabilities and features, such as full SQL, joins, and replication (things not available in Prometheus). To learn more, see why use TimescaleDB and Prometheus.
How do I use TimescaleDB and Prometheus? Do I have to use any special connectors?
Check out the demo :). I suggest using TimescaleDB as a remote read and write for Prometheus metrics, whether they’re infrastructure for an internal system or your public-facing eCommerce website. Since TimescaleDB extends Postgres, you use the pg_prometheus extension for Postgres and our prometheus_postgresql_adapter, and you’re ready to get started.
Whatever works with Postgres works with TimescaleDB, so if you want to connect to viz tools (like Grafana or Tableau), ingest data from places like Kafka or insert and analyze data using your favorite programming language (like Python or Go), just use one of the many connectors and libraries in the Postgres ecosystem.
Thank you again to the devopsdays NYC team for your work in pulling off such an interactive, fun, and community-first event! We’ll definitely be attending as future events are announced (virtually or otherwise).
In the meantime, those resources once more:
- Demo Video
- Tutorials: Prometheus, Grafana
- How to Analyze Your Prometheus Data in SQL: 3 Queries You Need to Know