Overview
PostgreSQL Extensions: amcheck
PostgreSQL Extensions: Unlocking Multidimensional Points With Cube
PostgreSQL Extensions: hstore
PostgreSQL Extensions: ltree
PostgreSQL Extensions: Secure Your Time-Series Data With pgcrypto
PostgreSQL Extensions: pg_prewarm
PostgreSQL Extensions: pgRouting
PostgreSQL Extensions: pg_stat_statements
PostgreSQL Extensions: Database Testing With pgTAP
PostgreSQL Extensions: Install pg_trgm for Data Matching
PostgreSQL Extensions: Turning PostgreSQL Into a Vector Database With pgvector
PostgreSQL Extensions: PL/pgSQL
PostgreSQL Extensions: Using PostGIS and Timescale for Advanced Geospatial Insights
Beyond the Basics: Exploring PostgreSQL Extensions
PostgreSQL Extensions: Intro to uuid-ossp
Written by Matvey Arye and Avthar Sewrathan
pgvector
is a PostgreSQL extension that provides powerful functionalities for working with vectors in high-dimensional space. It introduces a dedicated data type, operators, and functions that enable efficient storage, manipulation, and analysis of vector data directly within the PostgreSQL database.
Timescale started supporting the extension in May 2023, and we're already doing pretty cool stuff with it. Using it will allow you to query and analyze vector data alongside other structured data stored in Timescale. If you're looking for a vector database, know that PostgreSQL is all you need.
Timescale Vector includes the pgvector extension, as well as a new high-performance index type, the Timescale Vector index, that works with pgvector to speed up queries over very large vector datasets. Learn more about Timescale Vector.
Let’s walk through the main features and use cases for the pgvector PostgreSQL extension. We’ll cover a few examples of using pgvector to do analysis with relational and time-series data in TimescaleDB too.
The pgvector extension is a powerful tool for storing, modifying, and querying vectors. This functionality enables various applications such as similarity and semantic search, retrieval augmented generation, image search, recommendation systems, natural language processing (NLP), and computer vision.
Here are some of the key features and use cases of pgvector:
Vector storage: The pgvector extension lets you store high-dimensional vectors directly in PostgreSQL tables. It provides a dedicated data type for vector representation, allowing efficient storage and retrieval of vector data.
Similarity search: With pgvector, you can perform similarity searches based on vector similarity metrics such as cosine similarity or Euclidean distance. Searching for similar vectors facilitates applications like content-based recommendation systems, k nearest-neighbor search, and clustering.
Semantic search: large language model (LLM) embeddings are a way to create vectors from other types of data (e.g., text, images, etc). These embeddings represent the meaning of the underlying data. Therefore, using similarity search to find similar vectors returns data with
Natural Language Processing (NLP) and text analysis: Vector embeddings are powerful as they capture the underlying meaning of text. Techniques like word or document embeddings enable you to capture the ideas expressed in the text. You can then perform vector-based operations like similarity search, clustering, or classification on textual data.
Computer vision: The pgvector extension can handle vector representations of images and enable similarity-based image search. You can convert images to vector representations using techniques like convolutional neural networks (CNN) or image embeddings. This allows you to perform content-based image retrieval, image similarity matching, object identification, and image clustering within the database.
Integration with SQL queries: pgvector seamlessly integrates with SQL queries. This allows you to combine vector similarity search with other filtering or aggregation operations, enabling more complex data analysis in a myriad of ways. For instance, you can easily join vector data to other structured data in a single SQL query rather than in an application-level join.
By utilizing the pgvector extension, developers can efficiently store and query vector data within PostgreSQL, making it easier to build generative AI applications with LLMs, as well as AI applications that require similarity search, recommendation systems, NLP, and other tasks that involve working with vectors.
Installation: You can use pgvector on a cloud database service in Timescale or a self-hosted PostgreSQL instance:
Timescale’s cloud database instances come with pgvector pre-installed. So, all you have to do is run CREATE EXTENSION
or connect to the PostgreSQL service with Timescale Vector’s Python client library. Get started with pgvector on Timescale. It’s free for 90 days and billed based on your usage thereafter.
For self-hosting, you need to install the pgvector extension on your server. You can follow the pgvector installation instructions for Linux, MacOS, and Windows. Once installed, you can enable the extension in your database using the CREATE EXTENSION
command:
CREATE EXTENSION vector;
Vector data type: pgvector introduces a new data type called vector
representing a high-dimensional vector. You can define vector-type
columns in your database tables to store vector data. For example, you can have a table called documents
with an embedding
column of type vector
to store LLM embeddings as vectors for each document.
CREATE TABLE documents (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
content TEXT,
author_id BIGINT,
embedding VECTOR(1538)
);
Indexing and vector search: pgvector provides indexing mechanisms optimized for approximate nearest neighbor search over vector data. You can create an index on a vector column using the CREATE INDEX
command, specifying the hnsw
index type and a vector_cosine_ops
operator class. This enables fast similarity searches on vector data using the cosine distance similarity metric. You can then use the <=>
operator to perform similarity searches in your queries:
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
--find the closest 5 elements to a given query
SELECT * FROM documents ORDER BY embedding <=> '[10.5, 11.0,...]' LIMIT 5;.
Vector functions: pgvector comes with a set of built-in functions to manipulate and perform operations on vector data. These functions allow you to calculate vector similarities, perform vector arithmetic, and more. For example, you can use the cosine_distance
function to calculate the cosine similarity between two vectors. Other useful functions are vector_norm()
to get the Euclidean norm and vector_dims()
to determine how many dimensions a vector contains.
Vector aggregates: pgvector has the avg(vector)
and sum(vector)
aggregate functions for calculating analytics on vectors.
Integration with other PostgreSQL features: pgvector seamlessly integrates with other features of PostgreSQL, such as transaction management, query optimization, and security. This enables you to leverage the power of PostgreSQL for complex data processing tasks. Joining pgvector data with other data types, like relational, time-series, and geospatial data, is an especially powerful way to enrich your vector queries’ data. As a simple example, you could return author information for documents found via similarity search:
WITH matching_docs as (
--find 5 closest matches
SELECT *
FROM documents
ORDER BY embedding <=> '[10.5, 11.0,...]'
LIMIT 5
)
SELECT d.content, a.first_name, a.last_name
FROM matching_docs d
INNER JOIN author a ON (a.id = d.author_id).
Timescale is PostgreSQL++ engineered for time-series data, events, and analytics. It provides a robust foundation for storing, retrieving, and analyzing large volumes of time-series data thanks to its time-based aggregations, data retention policies, and features like hypertables and continuous aggregates.
Many real-world AI applications have retrieval or analytical requirements that include both vector data and a temporal aspect to similarity searches. Some examples of such workloads where time is important are embeddings of news, legal documents, and financial statements.
In such cases, users often want to find documents that are both similar and within a specific time frame. Another common use is time-weighing, where users may not use a strict time filter but still prefer more recent or older data. For such workloads, Timescale can markedly enhance both data ingestion and query speeds, thanks to hypertables—automatic time-based partitioning of tables. Timescale is especially good at optimizing queries that filter data based on time, enabling those queries to run more efficiently.
Furthermore, complex systems use many different types of data workloads like time series, event, historical, real-time, and structured relational data. Storing these different types of data in one database has many operational advantages. It simplifies operations management by bringing together tasks like backups, recovery, and security into one system, reducing complexity.
Additionally, combining different data types using joins enables more sophisticated data analysis. By keeping everything in the same database, you can avoid data silos, speed up your workflows and drive faster insights for your users.
Step 1. Set up pgvector
and timescaledb
on a PostgreSQL instance. You can do this either with a cloud-hosted database on Timescale (recommended) or a self-hosted instance.
Timescale: The easiest way to do this is with a database instance managed by Timescale, which comes with pgvector and timescaleDB already pre-installed. Get started for free with Timescale Cloud for 90 days. The only thing you need to do is execute
CREATE EXTENSION vector;
Self-hosted: To install the TimescaleDB extension in your PostgreSQL database, you can follow the installation guide provided by TimescaleDB based on your PostgreSQL version. To install pgvector, you can follow the installation instructions for Linux, MacOS, and Windows.
You then need to execute the following:
CREATE EXTENSION vector;
CREATE EXTENSION timescaledb;
Step 2. Create a Timescale hypertable with a vector column: Create a regular table in your database that will serve as the basis for the hypertable. Convert the table into a hypertable using the create_hypertable
function provided by TimescaleDB. Specify the time column and other relevant options.
CREATE TABLE documents (
id BIGINT GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ,
content TEXT,
embedding VECTOR(1538)
);
SELECT create_hypertable('documents', 'created_at');
Step 3. Insert vector data into the hypertable: Use regular SQL INSERT
statements to insert data into the hypertable, including the vector data in the designated vector column. Ensure that the vector data is in the correct format expected by pgvector. You will need to get the vector representation from an embedding model (e.g., OpenAI text-embedding-3 models or, if you prefer, open-source sentence transformers).
INSERT INTO documents (created_at, content,embedding)
VALUES ('2023-06-01 00:00:00', 'the quick', '[1,2,3]'),
('2023-06-02 00:00:00', 'brown fox', '[4,5,6]'),
('2023-06-03 00:00:00', 'jumped over' '[7,8,9]');
Step 4. Querying and analyzing vector data. Use SQL queries to perform various operations on the vector data stored in the hypertable. You can combine TimescaleDB's time-series functions with pgvector's vector functions for analysis and querying.
Examples of queries you can perform include:
Similarity search: Find vectors similar to a given query vector using the <=>
operator provided by pgvector. Combine that with filters on other columns, such as time.
Aggregation and grouping: Use TimescaleDB's time-series aggregation functions to aggregate vector data over time intervals or other dimensions.
Filtering and selection: Use regular SQL filters and conditions to select specific vector data based on certain criteria.
-- Similarity search
SELECT *
FROM documents
ORDER BY data <=> '[1,2,3]'::vector
LIMIT 5;
-- Similarity search filtered by time
SELECT *
FROM documents
WHERE time >= '2023-06-02 00:00:00' AND time < '2023-06-03 00:00:00'
ORDER BY data <=> '[1,2,3]'::vector
LIMIT 5;
-- Aggregation and grouping,
SELECT time_bucket('30 day', created_at) AS day, length(data) AS avg_data_length
FROM sensor_data
GROUP BY day
ORDER BY day;
-- Filtering and selection
SELECT *
FROM sensor_data
WHERE time >= '2023-06-02 00:00:00' AND time < '2023-06-03 00:00:00';
Step 6. Optimization and performance. Depending on the scale and complexity of your data, consider optimizing the performance of your queries by creating indexes on the vector
column using the CREATE INDEX
command with the hnsw
index type and the vector_cosine_ops
operator class. Tune the configuration parameters of TimescaleDB and PostgreSQL based on your workload and resource requirements to achieve optimal performance.
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
Want to learn more about pgvector and how to use it? Read the following resources:
Explore our documentation about working with pgvector for your AI application
Use the Timescale Vector Python library to make it easy to work with vector data
How to build with pgvector: PostgreSQL as a Vector Database: Create, Store, and Query OpenAI Embeddings With pgvector
Focus on performance: How we made PostgreSQL a better vector database
How to Build LLM Applications With pgvector Vector Store in LangChain
Get started with pgvector on a mature, production-ready cloud PostgreSQL platform: sign up for Timescale today (and get 90 days free).