AI

Dec 18, 2024

Evaluating Open-Source vs. OpenAI Embeddings for RAG: A How-To Guide

Evaluating Open-Source vs. OpenAI Embeddings for RAG: A How-To Guide

Posted by

Jacky Liang

When building a search or RAG (retrieval-augmented generation) application, you face a common challenge—which embedding model should you use? The most common choice is between proprietary models, like those from OpenAI, and open-source models, of which there are dozens to choose from. Once you’ve narrowed down candidates for testing, the actual testing process can also be very time-consuming. You need to tick the following checklist:

  • Set up each model's infrastructure.
  • Create consistent test data.
  • Build evaluation pipelines.
  • Run reliable benchmarks.
  • Compare results fairly.

Who’s got time for all of this?

Meme: Ain't nobody got time for that

Oh, and there’s another problem! 

If you already have a working search application, testing new models often means disrupting your existing setup or building a separate testing environment. Many times, this leads to testing and experimentation taking a backseat in priority, leaving potential accuracy gains and cost savings on the table.

We wanted to find a simpler way to evaluate different embedding models and understand their real-world performance. In this guide, we'll show you how to use pgai Vectorizer, an open-source tool for embedding creation and sync, to test different embedding models on your own data. We'll walk through our evaluation of four popular models (both open-source and OpenAI) using Paul Graham's essays as test data and share a reusable process you can adapt for your own model evaluations.

Here's what we learned from our tests—and how you can run similar comparisons yourself.

💡
If you’re using proprietary large language models and other AI tools and are looking for an alternative open-source AI stack, check out this article.

Embedding Models Compared: Open Source vs. OpenAI

In this embedding model evaluation, we will compare the following embedding models:

  1. OpenAI text-embedding-3-large (1,536 dimensions)
  2. OpenAI text-embedding-3-small (768 dimensions)
  3. BGE large (1,024 dimensions)
  4. nomic-embed-text (768 dimensions)

We chose these models because they represent both popular closed-source and open-source options for embeddings. OpenAI's models are widely used and considered industry standard, while BGE large and nomic-embed-text are leading open-source alternatives that can run locally.

Do Models Understand?

Given this text chunk from Paul Graham’s essay "Cities and Ambitions”:

"Cambridge as a result feels like a town whose main industry is ideas, while New York's is finance and Silicon Valley's is startups."

We tested questions like the following: 

  • "What industry is Cambridge known for?" (Short question)
  • "How does the economic focus of New York on finance shape opportunities compared to Cambridge?" (Detailed question)
  • "What does the comparison suggest about different economic focuses?" (Context-based question) 

Essentially, we are testing how different embedding models handle different ways of asking for the same information. By taking text chunks and generating various types of questions about them—from direct to paraphrased questions—we can see if the model understands not just exact matches but also meaning and context, which is typically how humans ask questions. 

Setting Up the Test Environment

Instead of building the test infrastructure from scratch, we will use pgai Vectorizer to simplify the test massively. You can follow the pgai Vectorizer quick start guide to set up pgai with Ollama in just a few minutes. 

Using pgai Vectorizer saves you significant development time by handling embedding operations directly in PostgreSQL. Here’s a list of what it can do:

  • Creates and updates embeddings automatically when source data changes
  • Supports models from OpenAI and a variety of open-source models via Ollama
  • Handles text chunking with configurable settings
  • Manages the embedding queue and retries
  • Creates a view combining source data with embeddings

Since pgai Vectorizer is built on PostgreSQL, you can use familiar SQL commands and integrate them with your existing database. Who needs bespoke vector databases? 

A meme: Just use PostgreSQL | Change my mind

First, let’s load Paul Graham’s essays essays into the database. We have a convenient function that loads datasets from Hugging Face directly into your PostgreSQL database.

SELECT ai.load_dataset('sgoel9/paul_graham_essays');

Testing different embedding models is super simple with pgai Vectorizer.

You simply create multiple vectorizers, each using a different model you want to evaluate. The vectorizer handles all the complexity of creating and managing embeddings for each model. Here's how we set up two models (nomic-embed-text and text-embedding-3-small, both are small embedding models) to compare:

-- Set up Nomic embed-text
SELECT ai.create_vectorizer( 
    'pg_essays'::regclass,
    destination => 'essays_nomic_embeddings',
    embedding => ai.embedding_ollama('nomic-embed-text', 768),
    chunking => ai.chunking_recursive_character_text_splitter('text', 512, 50)
);

-- Set up OpenAI's small embedding model
SELECT ai.create_vectorizer(
    'pg_essays'::regclass,
    destination => 'essays_openai_small_embeddings',
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_recursive_character_text_splitter('text', 512, 50)
);

You can query the generated embeddings directly in the embedding view for each model.

SELECT id, title, date, chunk, embedding 
FROM essays_nomic_embeddings 
LIMIT 5;

Pgai Vectorizer comes with a number of useful helper functions, such as monitoring the status of all vectorizers. This function is a good way to get an overview of all vectorizers created and the progress of their respective embedding tasks. 

SELECT * FROM ai.vectorizer_status;

If you want to try it out, here’s the full API reference for pgai Vectorizer

The Evaluation Logic

This evaluation will focus on how well each model can find relevant text when given different types of questions. The methodology is as follows:

  1. Randomly select 20 text chunks from the dataset.
  2. Generate 20 questions per chunk.
  3. These questions are evenly distributed across five distinct types: 
    1. Short questions (under 10 words) for testing basic comprehension.
    2. Long questions for detailed analysis.
    3. Direct questions about explicit content.
    4. Implied questions that require contextual understanding.
    5. Unclear questions to test handling of ambiguous queries.
  4. The evaluation retrieves the top 10 most similar chunks for each question across the embedding models.
  5. Vector search testing:
    1. For each model in the test list:
      1. For each stored question:
        1. Do vector search on the model's embedding table.
        2. Check if source_chunk_id appears in TOP_K results.
        3. Score binary: 1 if found, 0 if not found.
  6. Calculate the score:
    1. Calculate total successful retrievals.
    2. Divide by total questions (NUM_CHUNKS * NUM_QUESTIONS_PER_CHUNK).
    3. Tally up results.
A Bernie Sanders meme: I am once again asking for the most relevant context based on the input prompt


The advantage of this method of evaluation is the simplicity and the fact that you don’t have to curate the ground truth manually. In practice, we’ve seen this method work well, but it does have its limitations (as all methods do): If the content in the dataset is too semantically similar, the questions generated by the LLM may not be specific enough to retrieve the chunk the question was generated from, the evaluation does not check about the rank of the answer within the top-k, etc. You should always spot-check any eval method on your particular dataset. /rant

The Evaluation Code

The full evaluation code is available on GitHub if you want to run your own tests on different embedding models or change evaluation parameters. If you’re a video person, we got you covered, too

Here are some key highlights from the evaluation code.

Create questions for testing model understanding:

def generate_questions(self, chunk: str, question_type: str, count: int) -> List[str]:
    prompts = {
        'short': "Generate {count} short, simple questions about this text. Questions should be direct and under 10 words:",
        'long': "Generate {count} detailed, comprehensive questions about this text. Include specific details:",
        'direct': "Generate {count} questions that directly ask about explicit information in this text:",
        'implied': "Generate {count} questions that require understanding context and implications of the text:",
        'unclear': "Generate {count} vague, ambiguous questions about the general topic of this text:"
    }
    
    prompt = prompts[question_type].format(count=count) + f"\n\nText: {chunk}"

    questions = []
    
    for attempt in range(max_retries):
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Generate different types of questions about the given text. Each question must be on a new line. Do not include empty lines or blank questions."},
                {"role": "user", "content": prompt}
            ],
        )

Evaluate how well each model finds relevant content:

def step3_evaluate_models(self):
    """Test how well each model finds the relevant chunks"""
    for table in Config.EMBEDDING_TABLES:
        scores = []
        for q in self.questions_data:
            search_results = self.db.vector_search(
                table, 
                q['question'], 
                Config.TOP_K
            )
            # Check if the model found the correct chunk
            found = any(
                r[0] == q['source_chunk_id'] and 
                r[1] == q['source_chunk_seq'] 
                for r in search_results
            )
            scores.append(1 if found else 0)

The Results

A bar graph displaying each model's overall accuracy

OpenAI's large model performed best overall with 80.5 % accuracy, while their smaller model achieved 75.8 %. The open-source models were competitive—BGE large reached 71.5 % accuracy, and nomic-embed-text hit 71 %.

For our test data of 215 essays, our vectorizer created 8,426 chunks using a 512-token size and 50-token overlap setting. The embedding costs were reasonable: $0.03 for text-embedding-3-small and $0.15 for text-embedding-3-large. Open-source models had no API costs, only compute resources and time for running them locally.

Four bar graphs evaluating each model's accuracy by question type for short, long, direct, implied, and unclear questions

All models handled detailed questions surprisingly (or unsurprisingly?) well. Even the open-source options achieved around 90 % accuracy, with OpenAI's large model reaching a stunning 97.5 %. Unsurprisingly, it appears that when users provide more context in their queries, the models do better at finding relevant information. 

However, the biggest differences showed up in the questions requiring context understanding. OpenAI's large model reached 88.8 % accuracy here, while other models stayed around 75-78 %. We suspect that the extra dimensions in the larger model seem to help it capture more subtle relationships in the text. Open-source models still need a bit more progress on handling lack of context to catch up to closed source. 

Vague questions were challenging for every model, with accuracy between 42 % and 57 %. This isn't surprising—even humans struggle with ambiguous queries. 

How To Choose an Embedding Model

After reviewing the results, we noticed some critical points to consider when choosing your embedding model, chunking strategies, and input data quality:

Context matters a lot:

  • Detailed questions achieved 88-97 % accuracy across all models.
  • Context-based questions reached 75-89 % accuracy.
  • Vague questions achieved 42-57 % accuracy.

Size vs. performance trade-off:

  • OpenAI large (1536d) achieved 80.5 % overall accuracy.
  • OpenAI small (768d) achieved 75.8 % overall accuracy.
  • BGE large (1024d) achieved 71.5 % overall accuracy.
  • Nomic embed-text (768d) achieved 71 % overall accuracy.

Cost considerations:

  • Open-source models perform within 5-10 % of OpenAI's small model.
  • Larger models show clear advantages for context understanding (88.8 % vs. 75-78 %).
  • The performance gap may or may not justify the cost difference for your use case.

Before choosing a model, consider the following:

  1. Data type and volume: Input text quality and chunking strategies matter. Basically, garbage in will result in garbage out. Use pgai Vectorizer’s formatting to further improve context. 
  2. Typical user query patterns: How users search affects model choice—direct questions need less sophistication than natural language queries.
  3. Performance requirements: Cloud-based (like OpenAI) and local-based (like with Ollama) computational power will result in different performance, latency, and setup trade-offs. 

Cost constraints: OpenAI costs about $0.020 / 1M tokens for text-embedding-3-small and $0.130 / 1M tokens for text-embedding-3-large, while open-source models have no usage fees but need compute and time resources.

Conclusion

Choosing the right embedding model can make or break your AI application due to its implications on cost and efficiency. In this blog post, we used pgai Vectorizer to evaluate four embedding models and shared a checklist that you can use to test other models you may be interested in.

If you want to try this process on your own dataset, install pgai Vectorizer and take it out for a spin. While you’re at it, start saving time by using it to handle embedding operations without leaving PostgreSQL—no specialized databases required.

Further reading

Here are more blog posts about RAG with PostgreSQL and different tools:

  1. Simple Embedding Model Evaluation GitHub Code
  2. Pgai Vectorizer quick start for Ollama on self-hosted PostgreSQL
  3. Introduction to pgai Vectorizer

Originally posted

Dec 18, 2024

Share

pgai

3.3k

pgvectorscale

1.5k

Subscribe to the Timescale Newsletter

By submitting you acknowledge Timescale's Privacy Policy.