[Recap, Resources & More] DataPub #1: a virtual meetup for public data enthusiasts
Our guest speakers covered everything from techniques for wrangling public data and ways to better handle geospatial images to how to write faster queries (so you have more time to play Animal Crossing 🐯).
We just hosted our first meeting of DataPub - a new monthly meetup where we invite community members and guest speakers to talk about all things open data. While we love the hallway conversations that come with in-person events, there’s something special about online-only events; they eliminate the cost, time, and other factors that come with paying for tickets and traveling to conferences, so more people are able to join, and, as a result, we get more diverse perspectives.
If you attended the first session, thank you! If you missed it, we’ll host a new installment every 3rd Tuesday (RSVP here for our April 21st event), and this is the first of many summary posts (knowledge and transparency are power, so we always post recaps from any event we host or attend).
We believe that the world is a better place when we are able to use data to get answers to important questions, be it about our work projects or climate change (read more about why we launched DataPub and how we’ve used public data to monitor COVID-19).
What’s Open Data and why does it matter?
Ajay Kulkarni, Timescale co-founder & CEO, kicked us off and set the stage with a brief intro, welcome, and thank you to everyone.
By definition, open data is freely available for people to use, re-use, and redistribute without any legal, technological, or social restriction; it gives us the ability to have full transparency into governments, health, commerce, and more (see more from the Open Knowledge Foundation’s Handbook).
In short, open data is awesome and powerful, and, given the positive response to our first session, it appears we’re not alone in this belief :). Our goal? Use DataPub as a forum to help community members everywhere share new and interesting ways to use open data, find great public datasets, and better interpret and share results.
Guest speaker lineup and session summary:
For DataPub #1, our speakers dialed in from New Jersey, Portland, and India, and our live attendees spanned the globe.
Speaker #1: Joel Natividad, datHere CEO - “Flattening the Curve”
Joel is a long-time open data advocate, and his talk focuses on a timely topic: his journey wrangling the data he needed to understand the COVID-19 pandemic as it evolves. This is personal for him, as his wife is a nurse on the frontlines and New Jersey, his home, is one of the epicenters of COVID-19 outbreaks.
Joel started with Johns Hopkins University’s global confirmed cases data, building a set of time-series data utilities to make it easier to query and analyze for his purposes (e.g., understanding how cases were changing and moving across geographies over time). Per the above, we actually used this data ourselves to map the spread of COVID-19 with Grafana.
But, as he dug in, he realized that he wasn’t able to get the county-by-county level of granularity he needed to analyze how outbreaks may affect him, his family, and his colleagues around the US.
...meanwhile, the New York Times’ published an article with case reports from 1K+ counties, which Joel quickly discovered and lobbied the NYT to release the data to the public.
Not one to wait around, Joel wrote a Selenium Scraper to pull the data out of the New York Times’ article (NYT has since made the data publicly available). But, that still wasn’t perfect, and his data wrangling journey continued, taking him to USAFacts.org’s county-level open data, which gave him additional intelligence – and ultimately, helped him see what’s happening in his corner of the world.
Throughout his session, Joel reiterates how, while public datasets are often unstable and frequently change, perseverance pays off, and he shares tons of advice for anyone looking to wrangle this data for themselves.
Speaker #2: Saheel Ahmed, Blue Sky Analytics Data Scientist- “Intro to Cloud Optimized GeoTIFFs”
From there, we shifted to Saheel, who focuses on a different, yet important, global issue: understanding and fighting climate change. In his work at Blue Sky Analytics, he uses public satellite imagery to monitor and analyze environmental risk factors, like air pollution, fires, and water levels.
Given the nature of satellite imagery, raw geospatial image files are notoriously large, making them hard to store and visualize (unless you like extremely slow loading dashboards). How can we get around this problem?
Enter TIFFs: TIFF is a compression file format, mostly used in satellite or medical imagery, and geoTIFFs tell you where an image originated via a series of embedded coordinates. Saheel explains how, with Cloud Optimized GeoTIFFs (COGs) - developed with the Open Source Geospatial Foundation’s GDAL project - and HTTP GET range requests, we no longer need to load an entire file; we request, generate, and load just the parts of the file we need (using specific coordinates) for our queries and reports.
Saheel walks through how they’re building a data pipeline for this at Blue Sky Analytics, and how anyone can convert a regular GeoTIFF into a Cloud Optimized GeoTIFF to more easily work with publicly available government and organizational geospatial data.
Speaker #3: Jonan Scheffler, Timescale Developer Advocate - “Faster Queries = More Animal Crossing”
Animal Crossing is taking the developer world by storm, but where do they get the free time? Jonan’s hypothesis: they certainly aren’t waiting for slow-running queries.
Jonan shows you 3 ways to write faster queries, including some counterintuitive bits in SQL that cause inefficient queries. To illustrate, he uses Wikimedia Foundation's EventStreams data, which captures all edits to Wikimedia properties in real-time (since this is inherently time-series data, he wrote a Python package to store the data in TimescaleDB).
To try it yourself, you can get Jonan’s demo on GitHub, including: a Docker Compose file, TimescaleDB instance, and a Grafana dashboard with initial visualizations.
He uses this dataset and demo to share common query “mistakes” that slow down query performance - like issues with common table expressions, explicit type casting (
varchar) to take better advantage of native compression with TimescaleDB and writing more specific filters, so you’re using your indexes.
After his talk, you’ll have a few new ways to speed up your queries, so you have more time to play Animal Crossing (or your internet game of choice).
Thank you to everyone involved - from our speakers: Joel, Saheel, and Jonan to everyone who attended and the Timescale team members behind the scenes who helped make this happen.
We couldn’t have had a successful first virtual meetup without the support of the open data community – and we’re excited to host next month’s event!
- All registrants receive the recording, slides, and other resources shortly following the session, so register even if you’re unable to attend live.
If you have an open data project, dataset, or technology that you would like to share at a future DataPub, reach out to [email protected] and we’ll make it happen.