Vector Data

PostgreSQL Extensions: Turning PostgreSQL Into a Vector Database With pgvector

A visual representation of a vector, represented in a graph.

Written by Matvey Arye and Avthar Sewrathan

pgvector is a PostgreSQL extension that provides powerful functionalities for working with vectors in high-dimensional space. It introduces a dedicated data type, operators, and functions that enable efficient storage, manipulation, and analysis of vector data directly within the PostgreSQL database.

Timescale started supporting the extension in May 2023, and we're already doing pretty cool stuff with it. Using it will allow you to query and analyze vector data alongside other structured data stored in Timescale. If you're looking for a vector database, know that PostgreSQL is all you need. 

For AI and vector data, the Timescale Cloud platform includes the pgvector extension and two new open-source PostgreSQL extensions developed by the Timescale Team: pgai, which brings more AI workflows to PostgreSQL, making it easier for developers to build search and retrieval-augmented generation (RAG) applications; and pgvectorscale, which enables developers to build more scalable AI applications, with higher performance embedding search and cost-efficient storage. You can learn more about these two extensions in this article.

Let’s walk through the main features and use cases. We’ll also cover a few examples of analyzing relational and time-series data in TimescaleDB.

What are vectors?

Vectors refer to mathematical representations of data points in multidimensional space. They are typically arrays or lists of numerical values that capture an entity's essential features or attributes. In machine learning and data science, vectors encode information in a way that enables efficient computation and analysis.

Vectors can represent words, sentences, or documents, allowing for semantic search where similar meanings are identified even if the exact words differ. This can be used to find documents with similar content or answer queries based on the context of the text. By converting visual data into vectors, vector databases (such as PostgreSQL with the pgvector extension) enable searching for similar images or videos. This can be useful in applications like facial recognition, object detection, and content-based image retrieval.

In time series and other types of data, vectors can represent normal behavior patterns, allowing the detection of anomalies by identifying vectors that significantly deviate from the norm.

Understanding pgvector: Enabling Vector Operations

The pgvector extension is a powerful tool for storing, modifying, and querying vectors. This functionality enables applications such as similarity and semantic search, retrieval augmented generation, image search, recommendation systems, natural language processing (NLP), and computer vision.

Here are some of the key features and use cases of pgvector:

  • Vector storage: The pgvector extension lets you store high-dimensional vectors directly in PostgreSQL tables. It provides a dedicated data type for vector representation, allowing efficient storage and retrieval of vector data.

  • Similarity search: With pgvector, you can perform similarity searches based on vector similarity metrics such as cosine similarity or Euclidean distance. Searching for similar vectors facilitates applications like content-based recommendation systems, k nearest-neighbor search, and clustering.

  • Semantic search: Large language model (LLM) embeddings are a way to create vectors from other types of data (e.g., text, images, etc). These embeddings represent the meaning of the underlying data. Therefore, using similarity search to find similar vectors returns data with similar semantic meaning or content, effectively identifying data points that share common themes or topics regardless of their original format (text, images, etc).

  • Natural Language Processing (NLP) and text analysis: Vector embeddings are powerful as they capture the underlying meaning of text. Techniques like word or document embeddings enable you to capture the ideas expressed in the text. You can then perform vector-based operations like similarity search, clustering, or classification on textual data.

  • Computer vision: The pgvector extension can handle vector representations of images and enable similarity-based image search. You can convert images to vector representations using convolutional neural networks (CNN) or image embeddings. These capabilities allow you to perform content-based image retrieval, image similarity matching, object identification, and image clustering within the database.

A visual representation of a vector, represented in a graph.

Pgvector stores vectors that represent semantic meaning. Thus things that are meaningfully similar are “closer” together in the vector space. This allows people searching for SUVs to find cars and trucks instead of watermelons and apples, even though SUVs and trucks are different words.

Using pgvector for Vector Data

By utilizing the pgvector extension, developers can efficiently store and query vector data within PostgreSQL, making it easier to build generative AI applications with LLMs, as well as AI applications that require similarity search, recommendation systems, NLP, and other tasks that involve working with vectors.

  • Installation: You can use pgvector on a cloud database service in Timescale or a self-hosted PostgreSQL instance:

  • Vector data type: pgvector introduces a new data type called vector representing a high-dimensional vector. You can define vector-type columns in your database tables to store vector data. For example, you can have a table called documents with an embedding column of type vector to store LLM embeddings as vectors for each document.

    CREATE TABLE documents (
     id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
     content TEXT,
     author_id BIGINT,
     embedding VECTOR(1538)
    );
    
    

  • Indexing and vector search: pgvector provides indexing mechanisms optimized for approximate nearest neighbor search over vector data. You can create an index on a vector column using the CREATE INDEX command, specifying the hnsw index type and a vector_cosine_ops operator class. This enables fast similarity searches on vector data using the cosine distance similarity metric. You can then use the <=> operator to perform similarity searches in your queries:

    CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
    
    --find the closest 5 elements to a given query
    SELECT * FROM documents ORDER BY embedding <=> '[10.5, 11.0,...]' LIMIT 5;.
    
    

  • Vector functions: pgvector has a set of built-in functions to manipulate and perform operations on vector data. These functions allow you to calculate vector similarities, perform vector arithmetic, and more. For example, you can use the cosine_distance function to calculate the cosine similarity between two vectors.  Other useful functions are vector_norm() to get the Euclidean norm and vector_dims() to determine how many dimensions a vector contains.

  • Vector aggregates: pgvector has the avg(vector) and sum(vector) aggregate functions for calculating analytics on vectors.

  • Integration with other PostgreSQL features: pgvector seamlessly integrates with other features of PostgreSQL, such as transaction management, query optimization, and security. This integration enables you to leverage the power of PostgreSQL for complex data processing tasks. Joining pgvector data with other data types, like relational, time-series, and geospatial data, is a powerful way to enrich your vector queries’ data. As a simple example, you could return author information for documents found via similarity search:

WITH matching_docs as (
  --find 5 closest matches 
  SELECT * 
  FROM documents 
  ORDER BY embedding <=> '[10.5, 11.0,...]' 
  LIMIT 5
)
SELECT d.content, a.first_name, a.last_name
FROM matching_docs d 
INNER JOIN author a ON (a.id = d.author_id).

Why Use pgvector With Timescale?

Timescale is PostgreSQL++ engineered for time-series data, events, and analytics. It provides a robust foundation for storing, retrieving, and analyzing large volumes of time-series data thanks to its time-based aggregations, data retention policies, and features like hypertables and continuous aggregates.

Many real-world AI applications have retrieval or analytical requirements that include both vector data and a temporal aspect to similarity searches. Some examples of such workloads where time is essential are embeddings of news, legal documents, and financial statements.

In such cases, users often want to find similar documents within a specific time frame. Another common use is time-weighing, where users may not use a strict time filter but still prefer more recent or older data. For such workloads, Timescale can markedly enhance data ingestion and query speeds, thanks to hypertables—automatic time-based partitioning of tables. Timescale is especially good at optimizing queries that filter data based on time, enabling those queries to run more efficiently. 

Furthermore, complex systems use many different types of data workloads such as time series, event, historical, real-time, and structured relational data. Storing these different types of data in one database has many operational advantages. It simplifies operations management by combining tasks like backups, recovery, and security into one system, reducing complexity.

Additionally, combining different data types using joins enables more sophisticated data analysis. By keeping everything in the same database, you can avoid data silos, speed up your workflows and drive faster insights for your users.

How to Use pgvector With Timescale

Step 1. Set up pgvector and timescaledb on a PostgreSQL instance. You can do this either with a cloud-hosted database on Timescale (recommended) or a self-hosted instance.

You then need to execute the following:

CREATE EXTENSION vector;
CREATE EXTENSION timescaledb;

Step 2. Create a Timescale hypertable with a vector column: Create a regular table in your database that will serve as the basis for the hypertable. Convert the table into a hypertable using the create_hypertable function provided by TimescaleDB. Specify the time column and other relevant options. 

CREATE TABLE documents (
  id BIGINT GENERATED ALWAYS AS IDENTITY,
  created_at TIMESTAMPTZ,
  content TEXT,
  embedding VECTOR(1538)
);

SELECT create_hypertable('documents', 'created_at');

Step 3. Insert vector data into the hypertable: Use regular SQL INSERT statements to insert data into the hypertable, including the vector data in the designated vector column. Ensure that the vector data is in the correct format expected by pgvector. You will need to get the vector representation from an embedding model (e.g., OpenAI text-embedding-3 models or, if you prefer, open-source sentence transformers). 

INSERT INTO documents (created_at, content,embedding)
VALUES ('2023-06-01 00:00:00', 'the quick', '[1,2,3]'),
       ('2023-06-02 00:00:00', 'brown fox', '[4,5,6]'),
       ('2023-06-03 00:00:00', 'jumped over' '[7,8,9]');

Step 4. Querying and analyzing vector data. Use SQL queries to perform various operations on the vector data stored in the hypertable. You can combine TimescaleDB's time-series functions with pgvector's vector functions for analysis and querying.

Examples of queries you can perform include:

  • Similarity search: Find vectors similar to a given query vector using the <=> operator provided by pgvector. Combine that with filters on other columns, such as time.

  • Aggregation and grouping: Use TimescaleDB's time-series aggregation functions to aggregate vector data over time intervals or other dimensions.

  • Filtering and selection: Use regular SQL filters and conditions to select specific vector data based on certain criteria.

-- Similarity search
SELECT *
FROM documents
ORDER BY data <=> '[1,2,3]'::vector
LIMIT 5;

-- Similarity search filtered by time
SELECT *
FROM documents
WHERE time >= '2023-06-02 00:00:00' AND time < '2023-06-03 00:00:00'
ORDER BY data <=> '[1,2,3]'::vector
LIMIT 5;

-- Aggregation and grouping,
SELECT time_bucket('30 day', created_at) AS day, length(data) AS avg_data_length
FROM sensor_data
GROUP BY day
ORDER BY day;

-- Filtering and selection
SELECT *
FROM sensor_data
WHERE time >= '2023-06-02 00:00:00' AND time < '2023-06-03 00:00:00';

Step 6. Optimization and performance. Depending on the scale and complexity of your data, consider optimizing the performance of your queries by creating indexes on the vector column using the CREATE INDEX command with the hnsw index type and the vector_cosine_ops operator class. Tune the configuration parameters of TimescaleDB and PostgreSQL based on your workload and resource requirements to achieve optimal performance.

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

Next Steps

Want to learn more about pgvector and how to use it? Read the following resources:

Get started with pgvector on a mature, production-ready cloud PostgreSQL platform: sign up for Timescale today (and get 90 days free).