Subscribe to the Timescale Newsletter

By submitting you acknowledge Timescale's  Privacy Policy.

A Complete Guide to Creating and Storing Embeddings for PostgreSQL Data

A Complete Guide to Creating and Storing Embeddings for PostgreSQL Data

Why use embeddings for your PostgreSQL data

Vector embeddings provide a mathematical representation of data, encapsulating its semantic essence in a form that machines can readily process. While commonly associated with text, images, and audio, virtually any binary data can be converted into this format. 

Generating embeddings from data already stored in your PostgreSQL database unlocks a multitude of applications:

Embeddings enable semantic search, which transcends the limitations of traditional keyword-driven methods. It doesn't just seek exact word matches; it grasps the deeper intent behind a user's query. The result? Even if search terms differ in phrasing, relevant results are surfaced. Taking advantage of hybrid search, which marries lexical and semantic search methodologies, offers users a search experience that's both rich and accurate. It's not just about finding direct matches anymore; it's about tapping into contextually and conceptually similar content to meet user needs.

Recommendation systems benefit immensely from embeddings. Imagine a user who has shown interest in several articles on a singular topic. With embeddings, the recommendation engine can delve deep into the semantic essence of those articles, surfacing other database items that resonate with the same theme. Recommendations, thus, move beyond just the superficial layers like tags or categories and dive into the very heart of the content.

Generative AI, particularly retrieval-augmented generation (RAG), can be powered using the data stored in a PostgreSQL database. This turns your data into more than just tabular information; it becomes context for Large Language Models (LLMs) like OpenAI’s GPT-4 Turbo, Anthropic’s Claude 2, and open-source modes like Llama 2. When a user poses a query, relevant database content is fetched and used to supplement the query as additional context for the LLM. This helps reduce LLM hallucinations, as it ensures the model's output is more grounded in specific and relevant information, even if it wasn't part of the original training data.

Furthermore, embeddings offer a robust solution for clustering data in PostgreSQL. Transforming data into these vectorized forms enables nuanced comparisons between data points in a high-dimensional space. Through algorithms like K-means or hierarchical clustering, data can be categorized into semantic clusters, offering insights that surface-level attributes might miss. This deepens our grasp of inherent data patterns, enriching both exploration and decision-making processes.

This guide delves into the process of creating and managing embeddings for data residing in PostgreSQL using PgVectorizer, a library we developed to make managing embeddings simple. PgVectorizer both creates embedding from your data and keeps your relational and embedding data in sync as your data changes.

We'll navigate through architectural considerations, set up the library, perform a sync between your relational and embedding data, and query your embeddings. To learn more about how we built this, in this article, we go under the hood and explore how PgVectorizer works: we’ll cover schema layout, how we designed the system for performance, concurrency, and resilience, and explore a few alternative design decisions.

Let’s get started! 

Creating Embeddings for Data in PostgreSQL (and Keeping Them Up-To-Date With Your Tables)

As a running example, we’ll use a simple blog application storing data in PostgreSQL using a table defined as:

CREATE TABLE blog (
  id              SERIAL PRIMARY KEY NOT NULL,
  title           TEXT NOT NULL, 
  author          TEXT NOT NULL,
  contents        TEXT NOT NULL,
  category        TEXT NOT NULL,
  published_time  TIMESTAMPTZ NULL --NULL if not yet published
);

We want to create embeddings on the contents of the blog post so we can later use it for semantic search. Embeddings should only exist and be searchable for blogs that have been published (where the published_time is NOT NULL). 

To make working with embeddings simple and resilient, any system that creates embeddings should have the following goals:

  • No modifications to the original table. This allows systems and applications that already use this table not to be impacted by changes to the embedding system. This is especially important for legacy systems.
  • No modification to the applications that interact with the table. Having to modify the code that alters the table may not be possible for legacy systems. It’s also poor software design because it couples systems that don’t use embeddings with code that generates the embedding.
  • Automatically update embeddings when rows in the source table change (in this case, the blog table). This lessens the maintenance burden and contributes to worry-free software. At the same time, this update need not be instantaneous or within the same commit. For most systems, “eventual consistency” is just fine.
  • Ensure resilience against network and service failures: Most systems generate embeddings via a call to an external system, such as the OpenAI API. In scenarios where the external system is down or a network malfunction occurs, it's imperative that the remainder of your database system continues working.

These guidelines act as a robust framework for the following architecture: 

Reference architecture for a simple and resilient system for embedding data in an existing PostgreSQL table. We use the example use case of a blogging application, hence the names above.
Reference architecture for a simple and resilient system for embedding data in an existing PostgreSQL table. We use the example use case of a blogging application, hence the names above.

In this design, we first add a trigger to the blog table that monitors for changes and, upon seeing a modification, inserts a job into the blog_work_queue table that indicates that a row in the blog table is out-of-date with its embedding.

On a fixed schedule, an embeddings creator job will poll the blog_work_queue table, and if it finds work to do, will do the following in a loop:

  1. Read and lock a row in the blog_work_queue table 
  2. Read the corresponding row in the blog table
  3. Create an embedding for the data in the blog row
  4. Write the embedding to the blog_embedding table
  5. Delete the locked row in the  blog_work_queue table

Next, we’ll discuss how to implement this simply by using the Python Vector library, a library for working with vector data using PostgreSQL

Easily manage embedding PostgreSQL data using LangChain and the Python Vector library

We’ve added functionality to our library to make embedding PostgreSQL data as simple as possible. We call this functionality PgVectorizer.

Define your embedding creation function

There are myriad ways to embed your data. We don’t want to force you to use just one pre-defined method, so we ask users to define how to embed their data. Thus, we ask you to provide us with a Python function callback to create and store embeddings from database data. We call this the embed_and_write function, and it’s best to illustrate how to write it with an example. 

Using the blog example above, it could look like the following when embedding using LangChain, a popular framework to work with LLM applications:

from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from timescale_vector import client, pgvectorizer
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.timescalevector import TimescaleVector
from datetime import timedelta


import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), override=True)

TIMESCALE_SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]

def get_document(blog):
    text_splitter = CharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    docs = []
    for chunk in text_splitter.split_text(blog['contents']):
        content = f"Author {blog['author']}, title: {blog['title']}, contents:{chunk}"
        metadata = {
            "id": str(client.uuid_from_time(blog['published_time'])),
            "blog_id": blog['id'], 
            "author": blog['author'], 
            "category": blog['category'],
            "published_time": blog['published_time'].isoformat(),
        }
        docs.append(Document(page_content=content, metadata=metadata))
    return docs

def embed_and_write(blog_instances, vectorizer):
    # Note: the vectorizer argument isn’t used in this example but it
    # provides a way to get the name of the table being embedded,
    # along with other metadata.


    embedding = OpenAIEmbeddings()
    vector_store = TimescaleVector(
        collection_name="blog_embedding",
        service_url=TIMESCALE_SERVICE_URL,
        embedding=embedding,
        time_partition_interval=timedelta(days=30),
    )

    # delete old embeddings for all ids in the work queue. locked_id is a special column that is set to the primary key of the table being
    # embedded. For items that are deleted, it is the only key that is set.
    metadata_for_delete = [{"blog_id": blog['locked_id']} for blog in blog_instances]
    vector_store.delete_by_metadata(metadata_for_delete)

    documents = []
    for blog in blog_instances:
        # skip blogs that are not published yet, or are deleted (in which case it will be NULL)
        if blog['published_time'] != None:
            documents.extend(get_document(blog))

    if len(documents) == 0:
        return
    
    texts = [d.page_content for d in documents]
    metadatas = [d.metadata for d in documents]
    ids = [d.metadata["id"] for d in documents]
    vector_store.add_texts(texts, metadatas, ids)

The embed_and_write() function gets a list of blogs that have either been created, updated, or deleted. Its job is to update the vector store with the new blogs. We do this in two steps:

  1. Delete all existing vectors already in the vector store for items with the same primary key. The primary key is passed in via a special “locked_id” attribute. This is necessary if rows are deleted or updated.
  2. Create embeddings for all items that were updated or deleted. Deleted items will have all attributes other than “locked_id” set to None so that any attribute can be used as a sentinel. In the example above, we use “published_time” because we also want to skip embedding documents where “published_time” is NULL in the database.

The get_document() function is very use-case-specific, and you’ll have to adjust this code to suit your needs. Because of the context length limitations in LLM completion queries and token length limitations in embedding generation models, you will likely need some way to split long text up into smaller chunks. Here, we use a simple CharacterTextSplitter in LangChain, but much more complex approaches are possible. In the code above, we use a simple but effective trick: add some semantic context to each chunk by prepending the author and category. The only real requirement for the metadata generation portion is including the blog_id, which we can later use to delete old embeddings for a given blog.

In the code snippet above, we use time-based partitioning based on published_time. This type of partitioning drastically speeds up hybrid search on time and embedding similarity. We partition by time by setting the UUID based on the timestamp using client.uuid_from_time() function and by specifying time_partition_interval=timedelta(days=30) when creating the pgai on Timescale vector store. This type of partitioning really speeds up search when filtering by both published_time and vector similarity. See this explainer video for more on how time-based partitioning works in pgai on Timescale (previously known as Timescale Vector).

Once this is written, all you have to do is call the following code on a schedule:

vectorizer = pgvectorizer.Vectorize(service_url, 'blog')
while vectorizer.process(embed_and_write) > 0:
    pass

This is the embedding creator job, which will sync your PostgreSQL data with a vector store. You can run this Python script on a schedule from practically anywhere:

It will automatically track which rows within the blog table have changed and call the embed_and_write function on batches of changed rows. It is performant, resilient to failures, and can be run in parallel when you have a backlog of things that need to be embedded. Designing such a system to perform well is harder than it sounds. But we’ve done it for you in PgVectorizer as part of the Python Vector library

Searching through your embeddings

Use of the embeddings depends on how the embeddings were generated as well as the use case. We will illustrate some simple search applications to work with the LangChain example we gave above and also provide references for more advanced applications, such as hybrid search on metadata and time.

TABLE_NAME = "blog_embedding"
embedding = OpenAIEmbeddings()
vector_store = TimescaleVector(
        collection_name=TABLE_NAME,
        service_url=TIMESCALE_SERVICE_URL,
        embedding=embedding,
        time_partition_interval=timedelta(days=30)
)
# find closest item
res = vector_store.similarity_search_with_score("Tell me about Travel to Istanbul", 1);

#hybrid search with time
start_dt = datetime(2021, 1, 1, 0, 0, 0) 
end_dt = datetime(2024, 1, 1, 0, 0, 0)
res = vector_store.similarity_search_with_score("Tell me about Travel to Istanbul", 1, start_date=start_dt, end_date=end_dt);

There are, of course, a lot more options for search, including filters and predicates on metadata, self-query retriever options, integrations with chat and RAG, and more! We recommend reading this LangChain tutorial for more info about the above-mentioned methods! 

Note: While the above example uses LangChain, you can also swap in frameworks like LlamaIndex, a popular LLM data framework that integrates well with pgai on Timescale, or do DIY document parsing and embedding in Python using OpenAI’s text-embedding-ada-002 model, or an open-source embedding model like sentence-transformers, while using the Python Vector client for vector search. 

Conclusion and Next Steps

In this blog post, we have outlined a system adept at generating vector embeddings from data stored in PostgreSQL and automatically keeping them up to date. This architecture ensures the embeddings remain synchronized with the perpetually evolving data, responding seamlessly to insertions, modifications, and deletions. 

Using PostgreSQL to handle both data storage and background embedding generation offers an interesting new paradigm for maintaining embeddings as data changes. Many AI demonstrations and tutorials tend to concentrate only on initial data creation from documents, often missing the complexities of keeping data and embeddings synchronized as it evolves.

If your goal is to embed and keep up-to-date data in a PostgreSQL table, then you are done! Just use the PgVectorizer class in the Python Vector library and the code above to start embedding your PostgreSQL data and leveraging semantic and hybrid search in your applications. 

If you are curious about how the PgVectorizer library works “under the hood” and how we designed the system for high performance, see our companion blog post about how we designed a resilient embeddings system for PostgreSQL data, which discusses the system design decisions and trade-offs we made while building PgVectorizer above. 

If you’d like to go straight to applying what you learned to your own data in PostgreSQL, here are some resources to continue your learning journey:

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
9 min read
PostgreSQL
Contributors

Related posts