Written by Haziqa Sajid
Have you ever wondered how artificial intelligence (AI) systems can understand our language? The key concept is embeddings, where words and phrases are converted into high-dimensional vectors that capture their meanings and relationships. These vectors allow computers to perform mathematical operations on language data. The challenge then becomes storing these high-dimensional vectors efficiently.
The good news is that you can use an old friend, PostgreSQL, and transform it into a full-fledged vector database with the pgvector extension. Adding Python to the mix allows you to build Python applications with machine learning elements.
For those who aren’t sure about how to achieve this, here’s what you’ll learn in this guide:
What are vector databases and pgvector
How to install and use pgvector in Python
How to leverage PostgreSQL for your AI applications
Let’s start by explaining the concepts required to understand pgvector. Then, we will provide the technical details to get up and running with pgvector in Python.
A vector database stores data as high-dimensional vectors, mathematical representations of features or attributes. Vectors, with dimensions ranging from tens to thousands, are derived from raw data like text, images, and even videos through embedding algorithms. These databases are designed for fast and precise similarity search and enhanced search based on semantic meaning.
Editor’s Note: Check this article to learn the difference between vector search and semantic search.
PostgreSQL does not have native vector capabilities (as of PostgreSQL 16), and pgvector is designed to fill this gap. Pgvector is an open-source vector similarity search extension for PostgreSQL. Here's how it enhances PostgreSQL, making it a proficient vector database:
Vector and traditional store combined: With pgvector, you can store both vector and traditional data, ensuring compatibility with traditional database features while providing newer capabilities.
Advanced search: Pgvector stores data as vectors, enabling various nearest-neighbor search algorithms for exact or approximate searches (the latter are used when the search space is extensive), such as L2, inner product, and cosine distance. This facilitates the efficient finding of the most similar content based on a given query. Here are all the other search algorithms:
Supported operators in pgvector
3. Integration with PostgreSQL features: Pgvector seamlessly integrates with standard PostgreSQL features, enhancing its capabilities, which include the following:
ACID compliance: ensuring transactional integrity
Point-in-time recovery: enabling database restoration to specific moments
JOIN support: facilitating data combination from multiple tables
Enough theory; let’s get into practicality.
In the previous section, we built an understanding of pgvector. This section will set the foundation for seeing it in action using Python.
If you haven't already, start by installing a PostgreSQL database. Then, install psql on your OS to connect to it more easily. Psql, a terminal-based interactive program from PostgreSQL, facilitates direct interaction with the PostgreSQL server. This interface allows users to execute SQL statements and manage various database objects seamlessly.
If you have installed PostgreSQL correctly, executing the following command in the terminal will provide version information:
PostgreSQL installation verification
Note: You must manually add the path to the environment variables on Windows.
Now, we are ready to connect to the database.
To connect to the database using psql, we will run the following command in the terminal:
> psql -U postgres
It will prompt for the password, and upon successful authentication, you will be connected to the database.
Connecting to the database using psql
Let’s add the pgvector extension and see the relevant requirements. Later, we will walk you through how to use pgvector with Python.
Installing pgvector On Windows, ensure that C++ support in Visual Studio is installed. If not, install it from here and check the Desktop development with C++.
C++ support in Visual Studio
After this, run the following command in the Command prompt. Make sure it is open in administrator mode.
> call "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat
Note: The specific path will vary based on the version and edition of Visual Studio.
In the next steps, we will set the root and clone the pgvector repository. Then we will use nmake
to build:
> set "PGROOT=C:\Program Files\PostgreSQL\16"
> cd %TEMP%
> git clone --branch v0.7.0 https://github.com/pgvector/pgvector.git
> cd pgvector
> nmake /F Makefile.win
> nmake /F Makefile.win install
Note: nmake
is Microsoft's implementation of the make tool. It comes packaged with the C++ development kit. If the path is not recognized in the command prompt, ensure it is added.
After success, using psql run the following query:
postgres=# CREATE EXTENSION VECTOR;
To ensure pgvector
is installed, run this query:
postgres=# \dx
Importing pgvector to Python
To use the pgvector
extension in your Python projects, follow these steps:
1. Install the pgvector
package in your Python environment. This can be done using pip
:
%pip install pgvector
2. Once installed, import the pgvector
package in your Python script:
import pgvector
3. Additional packages are required to connect to a PostgreSQL database. A commonly used package is psycopg2.
import psycopg2
4. Depending on your specific requirements, you may need to install and import other packages. For example:
For data manipulation and analysis, you might use pandas
:
%pip install pandas
import pandas as pd
For object-relational mapping (ORM), you might use SQLAlchemy
:
%pip install SQLAlchemy
from sqlalchemy import create_engine
These steps will give you the tools and packages to work with pgvector
and PostgreSQL in your Python environment.
This section will explore using the pgvector
extension in PostgreSQL with Python, from connecting to the database to querying. We set up all the installations and imports, so let’s get started.
First, we must connect to our PostgreSQL database using psycopg2
. Here's how you can do it:
import psycopg2
conn = psycopg2.connect( host="localhost",
database="your_database_name",
user="your_username",
password="your_password")
In this snippet, we import the psycopg2
library and use it to connect to the PostgreSQL server. Replace your_database_name
, your_username
, and your_password
with your actual database credentials.
Next, create a cursor to execute SQL commands.
cur = conn.cursor()
The cursor is used to execute SQL commands and fetch results. We can also ensure the connection by the following “hello world” example.
cur = conn.cursor()
cur.execute("SELECT 'hello world'")
print(cur.fetchone())
Basic example: Start by creating a simple table with an embedding vector. In the following snippet, we define and execute an SQL command to create a table named vectors with an embedding column of dimension three (3).
# Define a SQL command to create a table
create_table_command = """
CREATE TABLE vectors (
id bigserial primary key,
embedding vector(3) -- specify the dimension of the vector
);
"""
# Execute the SQL command
cur.execute(create_table_command)
# Commit the transaction
conn.commit()
We can confirm the creation of a table by querying \dt
in psql:
Psql service to ensure the table creation
Example with more features: Let's create a table with more features for language model training. This command creates a table named embeddings with additional columns such as label
, url
, content
, tokens
, and an embedding vector
of dimension three (3).
# Define a SQL command to create a more complex table
create_table_command = """
CREATE TABLE embeddings (
id bigserial primary key,
label text,
url text,
content text,
tokens integer,
embedding vector(3) -- specify the dimension of the vector
);
"""
# Execute the SQL command
cur.execute(create_table_command)
# Commit the transaction
conn.commit()
Let's insert some data in the newly created table. In the real world, the embeddings are generated using a model trained on a vast corpus of data. These models understand the semantics behind the words.
For this example, we resort to a random array. We use the NumPy library to generate and insert random vectors into the embeddings table. Each entry includes a label, URL, content, token count, and a vector.
import numpy as np
# Define the SQL command for inserting data
insert_command = """
INSERT INTO embeddings (label, url, content, tokens, embedding)
VALUES (%s, %s, %s, %s, %s);
"""
# Create sample data
data = [
("label1", "http://example.com/1", "The article on dogs", 100, np.random.rand(3).tolist()),
("label2", "http://example.com/2", "The article on cats", 150, np.random.rand(3).tolist()),
("label3", "http://example.com/3", "The article on cars", 200, np.random.rand(3).tolist()),
("label4", "http://example.com/4", "The article on books", 250, np.random.rand(3).tolist()),
("label5", "http://example.com/5", "The article on embeddings", 300, np.random.rand(3).tolist())
]
# Insert data into the table
for record in data:
cur.execute(insert_command, record)
# Commit the transaction
conn.commit()
Let's check the data stored in the embeddings
table. We will use fetchall()
to retrieve the data from the table.
fetch_contents = """
select * from embeddings
"""
# Execute the SQL command
cur.execute(fetch_contents)
# Commit the transaction
cur.fetchall()
Retrieve the top five similar objects from the database using nearest neighbors. In this example, we generate a random query vector to find the top five similar entries in the embeddings table based on vector similarity. Note the explicit type cast::vector
to ensure the comparison is understood by PostgreSQL.
# Generate a random query vector
query_vector = np.random.rand(3).tolist() # Example query vector
print(query_vector)
# Define the SQL command to retrieve similar objects
retrieve_command = """
SELECT content
FROM embeddings
ORDER BY embedding <=> %s::vector
LIMIT 5;
"""
# Execute the command with the query vector
cur.execute(retrieve_command, (query_vector,))
# Fetch the results
similar_objects = cur.fetchall()
# Print the similar objects
for obj in similar_objects:
print(obj[0])
Finally, close the cursor and connection. Closing the cursor and connection ensures that all resources are properly released.
cur.close()
conn.close()
So far, we've explored pgvector
for vector storage and similarity search in PostgreSQL. But you can experience a supercharged version of pgvector
: with Timescale Cloud, developers can use pgvector
alongside pgvectorscale
and pgai
, two open-source extensions that turn PostgreSQL into an easy-to-use and high-performance vector database, plus a fully managed cloud database experience.
Timescale Cloud enhances PostgreSQL for AI applications by packaging pgvector
as part of its offering. It retains all the capabilities of pgvector
, such as the vector data type and indexing methods like HNSW
and IVFFlat
, making it a complement rather than a replacement for pgvector
.
Timescale Cloud introduces new features on top of pgvector
, such as time-based vector search. This makes it easy to migrate your existing pgvector deployment and take advantage of additional features for scale. You also have the flexibility to create different index types suited to your needs.
With pgvectorscale, Timescale Cloud accelerates approximate nearest-neighbor (ANN) searches on large-scale vector datasets by incorporating a cutting-edge ANN index inspired by the DiskANN algorithm (StreamingDiskANN). Also included is pgai, which brings more AI workflows to PostgreSQL, making it easier for developers to build search and retrieval-augmented generation (RAG) applications. This complete stack will make the development of AI applications faster, more efficient, and scalable.
Check out this blog post to learn more about pgai and how it brings embedding and generation models closer to the database. We also explained how pgvectorscale makes PostgreSQL faster and cheaper than other specialized vector databases, like Pinecone.
Let’s set up Timescale Cloud with Python and see it in action.
Installations: Ensure you have installed the required Python packages, timescale_vector
, and python-dotenv
, using the following command:
%pip install timescale_vector python-dotenv
2. Import required libraries: Import the necessary libraries in your Python script as shown below:
from dotenv import load_dotenv, find_dotenv
import os
from timescale_vector import client
import uuid
from datetime import datetime, timedelta
3. Setting up Timescale: To start, sign up, create a new database, and follow the provided instructions. For more information, refer to the Getting Started with Timescale guide.
After signing up, connect to the Timescale database by providing the service URI, which can be found under the service section on the dashboard. The URI will look something like this:
postgres://tsdbadmin:@.tsdb.cloud.timescale.com:/tsdb?sslmode=require
Configuration dashboard for connecting to the service
The password can be created by going to project settings and clicking Create credentials.
Project settings page for creating credentials
4. Set Up .env File: Create a .env file in your project directory and include your PostgreSQL credentials in the following format:
TIMESCALE_SERVICE_URL=your_service_url_here
5. Load PostgreSQL Credentials: Load your PostgreSQL credentials from the .env file:
_ = load_dotenv(find_dotenv(), override=True)
service_url = os.environ['TIMESCALE_SERVICE_URL']
A vector can be created with the client that takes the following arguments:
Name | Description |
service_url | Timescale service URL / connection string |
table_name | Name of the table to use for storing the embeddings |
num_dimensions | Number of dimensions in the vector. |
Initialize the vector client with your service URL, table name, and the number of dimensions in the vector:
vec = client.Sync(service_url, "embeddings", 3)
Now, we will create tables for the collection and insert data. The data is represented as follows:
UUID serves as a unique identifier for the embedding.
Metadata about the embedding is stored in JSON format.
Text representing the embedding.
Embedding itself is also included.
vec.create_tables()
vec.upsert([
(uuid.uuid1(), {"service": "pgvectors"}, "Pgvector is an open-source vector similarity search extension of PostgreSQL.", [1.0, 1.3, 2.0]),
(uuid.uuid1(), {"service": "timescale_vector"}, "Timescale Vector amplifies Pgvector's capabilities by integrating it into PostgreSQL for AI applications, adding new functionalities like the Timescale Vector index and time-based vector search, while also enhancing scalability with advanced ANN indexing inspired by the DiskANN algorithm.", [1.0, 10.8, 3.0]),
])
Let's ensure that the data is updated in the table as well. We can see it in the Explorer tab in the Timescale console:
Ensuring data insertion in the Timescale console
Timescale Cloud provides a very simple API for searching similar vectors. As mentioned before, these vectors are generated by specialized models when given data. But in this example, it's just a random array being queried.
vec.search([2.0, 9.4, 3.0])
Here’s the result:
>> [[UUID('5773dc0f-20d6-11ef-8777-dae2664cc367'),
{'service': 'timescale_vector'},
"Timescale Vector amplifies Pgvector's capabilities by integrating it into PostgreSQL for AI applications, adding new functionalities like the Timescale Vector index and time-based vector search, while also enhancing scalability with advanced ANN indexing inspired by the DiskANN algorithm.",
array([ 1. , 10.8, 3. ], dtype=float32),
0.006858884829482381],
[UUID('5773dc0e-20d6-11ef-a334-dae2664cc367'),
{'service': 'pgvectors'},
'Pgvector is an open-source vector similarity search extension of PostgreSQL.',
array([1. , 1.3, 2. ], dtype=float32),
0.2235118982678087]]
In the article, we briefly understood the concept of vector databases and elaborated on pgvector. The article covered the installation of the extension, and we used Python to play around with it. Later, we used Timescale Cloud—which includes pgvector, pgvectorscale, and pgai—to leverage its faster search capabilities.
Utilizing pgvector with Python empowers the development of robust machine-learning applications. With pgai and pgvectorscale, Timescale Cloud elevates PostgreSQL vector databases to new heights, offering enhanced capabilities and performance at scale.
Pgai and pgvectorscale are both open source under the PostgreSQL License and available for you to use in your AI projects today. You can find installation instructions on the pgai and pgvectorscale GitHub repositories (Git ⭐s welcome!). You can also access them on any database service on Timescale’s cloud PostgreSQL platform.