Timescale Logo

PostgreSQL Extensions: Turning PostgreSQL Into a Vector Database With pgvector

Written by Matvey Arye and Avthar Sewrathan

pgvector is a PostgreSQL extension that provides powerful functionalities for working with vectors in high-dimensional space. It introduces a dedicated data type, operators, and functions that enable efficient storage, manipulation, and analysis of vector data directly within the PostgreSQL database.

Timescale started supporting the extension in May 2023, and we're already doing pretty cool stuff with it. Using it will allow you to query and analyze vector data alongside other structured data stored in Timescale. If you're looking for a vector database, know that PostgreSQL is all you need. 

Timescale Vector includes the pgvector extension, as well as a new high-performance index type, the Timescale Vector index, that works with pgvector to speed up queries over very large vector datasets. Learn more about Timescale Vector.

Let’s walk through the main features and use cases for the pgvector PostgreSQL extension. We’ll cover a few examples of using pgvector to do analysis with relational and time-series data in TimescaleDB too.

Understanding pgvector: Enabling Vector Operations

The pgvector extension is a powerful tool for storing, modifying, and querying vectors. This functionality enables various applications such as similarity and semantic search, retrieval augmented generation, image search, recommendation systems, natural language processing (NLP), and computer vision.

Here are some of the key features and use cases of pgvector:

  • Vector storage: The pgvector extension lets you store high-dimensional vectors directly in PostgreSQL tables. It provides a dedicated data type for vector representation, allowing efficient storage and retrieval of vector data.

  • Similarity search: With pgvector, you can perform similarity searches based on vector similarity metrics such as cosine similarity or Euclidean distance. Searching for similar vectors facilitates applications like content-based recommendation systems, k nearest-neighbor search, and clustering.

  • Semantic search: large language model (LLM) embeddings are a way to create vectors from other types of data (e.g., text, images, etc). These embeddings represent the meaning of the underlying data. Therefore, using similarity search to find similar vectors returns data with

  • Natural Language Processing (NLP) and text analysis: Vector embeddings are powerful as they capture the underlying meaning of text. Techniques like word or document embeddings enable you to capture the ideas expressed in the text. You can then perform vector-based operations like similarity search, clustering, or classification on textual data.

  • Computer vision: The pgvector extension can handle vector representations of images and enable similarity-based image search. You can convert images to vector representations using techniques like convolutional neural networks (CNN) or image embeddings. This allows you to perform content-based image retrieval, image similarity matching, object identification, and image clustering within the database.

  • Integration with SQL queries: pgvector seamlessly integrates with SQL queries. This allows you to combine vector similarity search with other filtering or aggregation operations, enabling more complex data analysis in a myriad of ways. For instance, you can easily join vector data to other structured data in a single SQL query rather than in an application-level join.

Using pgvector for Vector Data

By utilizing the pgvector extension, developers can efficiently store and query vector data within PostgreSQL, making it easier to build generative AI applications with LLMs, as well as AI applications that require similarity search, recommendation systems, NLP, and other tasks that involve working with vectors.

  • Installation: You can use pgvector on a cloud database service in Timescale or a self-hosted PostgreSQL instance:

  • Vector data type: pgvector introduces a new data type called vector representing a high-dimensional vector. You can define vector-type columns in your database tables to store vector data. For example, you can have a table called documents with an embedding column of type vector to store LLM embeddings as vectors for each document.

    CREATE TABLE documents (
     id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
     content TEXT,
     author_id BIGINT,
     embedding VECTOR(1538)
    );
    
    

  • Indexing and vector search: pgvector provides indexing mechanisms optimized for approximate nearest neighbor search over vector data. You can create an index on a vector column using the CREATE INDEX command, specifying the hnsw index type and a vector_cosine_ops operator class. This enables fast similarity searches on vector data using the cosine distance similarity metric. You can then use the <=> operator to perform similarity searches in your queries:

    CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
    
    --find the closest 5 elements to a given query
    SELECT * FROM documents ORDER BY embedding <=> '[10.5, 11.0,...]' LIMIT 5;.
    
    

  • Vector functions: pgvector comes with a set of built-in functions to manipulate and perform operations on vector data. These functions allow you to calculate vector similarities, perform vector arithmetic, and more. For example, you can use the cosine_distance function to calculate the cosine similarity between two vectors.  Other useful functions are vector_norm() to get the Euclidean norm and vector_dims() to determine how many dimensions a vector contains.

  • Vector aggregates: pgvector has the avg(vector) and sum(vector) aggregate functions for calculating analytics on vectors.

  • Integration with other PostgreSQL features: pgvector seamlessly integrates with other features of PostgreSQL, such as transaction management, query optimization, and security. This enables you to leverage the power of PostgreSQL for complex data processing tasks. Joining pgvector data with other data types, like relational, time-series, and geospatial data, is an especially powerful way to enrich your vector queries’ data. As a simple example, you could return author information for documents found via similarity search:

WITH matching_docs as (
  --find 5 closest matches 
  SELECT * 
  FROM documents 
  ORDER BY embedding <=> '[10.5, 11.0,...]' 
  LIMIT 5
)
SELECT d.content, a.first_name, a.last_name
FROM matching_docs d 
INNER JOIN author a ON (a.id = d.author_id).

Why Use pgvector With Timescale?

Timescale is PostgreSQL++ engineered for time-series data, events, and analytics. It provides a robust foundation for storing, retrieving, and analyzing large volumes of time-series data thanks to its time-based aggregations, data retention policies, and features like hypertables and continuous aggregates.

Many real-world AI applications have retrieval or analytical requirements that include both vector data and a temporal aspect to similarity searches. Some examples of such workloads where time is important are embeddings of news, legal documents, and financial statements.

In such cases, users often want to find documents that are both similar and within a specific time frame. Another common use is time-weighing, where users may not use a strict time filter but still prefer more recent or older data. For such workloads, Timescale can markedly enhance both data ingestion and query speeds, thanks to hypertables—automatic time-based partitioning of tables. Timescale is especially good at optimizing queries that filter data based on time, enabling those queries to run more efficiently. 

Furthermore, complex systems use many different types of data workloads like time series, event, historical, real-time, and structured relational data. Storing these different types of data in one database has many operational advantages. It simplifies operations management by bringing together tasks like backups, recovery, and security into one system, reducing complexity.

Additionally, combining different data types using joins enables more sophisticated data analysis. By keeping everything in the same database, you can avoid data silos, speed up your workflows and drive faster insights for your users.

How to Use pgvector With Timescale

Step 1. Set up pgvector and timescaledb on a PostgreSQL instance. You can do this either with a cloud-hosted database on Timescale (recommended) or a self-hosted instance.

You then need to execute the following:

CREATE EXTENSION vector;
CREATE EXTENSION timescaledb;

Step 2. Create a Timescale hypertable with a vector column: Create a regular table in your database that will serve as the basis for the hypertable. Convert the table into a hypertable using the create_hypertable function provided by TimescaleDB. Specify the time column and other relevant options. 

CREATE TABLE documents (
  id BIGINT GENERATED ALWAYS AS IDENTITY,
  created_at TIMESTAMPTZ,
  content TEXT,
  embedding VECTOR(1538)
);

SELECT create_hypertable('documents', 'created_at');

Step 3. Insert vector data into the hypertable: Use regular SQL INSERT statements to insert data into the hypertable, including the vector data in the designated vector column. Ensure that the vector data is in the correct format expected by pgvector. You will need to get the vector representation from an embedding model (e.g., OpenAI text-embedding-3 models or, if you prefer, open-source sentence transformers). 

INSERT INTO documents (created_at, content,embedding)
VALUES ('2023-06-01 00:00:00', 'the quick', '[1,2,3]'),
       ('2023-06-02 00:00:00', 'brown fox', '[4,5,6]'),
       ('2023-06-03 00:00:00', 'jumped over' '[7,8,9]');

Step 4. Querying and analyzing vector data. Use SQL queries to perform various operations on the vector data stored in the hypertable. You can combine TimescaleDB's time-series functions with pgvector's vector functions for analysis and querying.

Examples of queries you can perform include:

  • Similarity search: Find vectors similar to a given query vector using the <=> operator provided by pgvector. Combine that with filters on other columns, such as time.

  • Aggregation and grouping: Use TimescaleDB's time-series aggregation functions to aggregate vector data over time intervals or other dimensions.

  • Filtering and selection: Use regular SQL filters and conditions to select specific vector data based on certain criteria.

-- Similarity search
SELECT *
FROM documents
ORDER BY data <=> '[1,2,3]'::vector
LIMIT 5;

-- Similarity search filtered by time
SELECT *
FROM documents
WHERE time >= '2023-06-02 00:00:00' AND time < '2023-06-03 00:00:00'
ORDER BY data <=> '[1,2,3]'::vector
LIMIT 5;

-- Aggregation and grouping,
SELECT time_bucket('30 day', created_at) AS day, length(data) AS avg_data_length
FROM sensor_data
GROUP BY day
ORDER BY day;

-- Filtering and selection
SELECT *
FROM sensor_data
WHERE time >= '2023-06-02 00:00:00' AND time < '2023-06-03 00:00:00';

Step 6. Optimization and performance. Depending on the scale and complexity of your data, consider optimizing the performance of your queries by creating indexes on the vector column using the CREATE INDEX command with the hnsw index type and the vector_cosine_ops operator class. Tune the configuration parameters of TimescaleDB and PostgreSQL based on your workload and resource requirements to achieve optimal performance.

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

Next Steps

Want to learn more about pgvector and how to use it? Read the following resources:

Get started with pgvector on a mature, production-ready cloud PostgreSQL platform: sign up for Timescale today (and get 90 days free).

Timescale Logo

Subscribe to the Timescale Newsletter

By submitting, I acknowledge Timescale’s Privacy Policy
2024 © Timescale Inc. All rights reserved.