April 15, 2024

Engineering

Why we chose PGVector as our similarity searching solution

A closer look at why we adopted pgvector to build a highly performant Similarity Searching Solution.

April 15, 2024

Engineering

Why we chose PGVector as our similarity searching solution

A closer look at why we adopted pgvector to build a highly performant Similarity Searching Solution.

April 15, 2024

Engineering

Why we chose PGVector as our similarity searching solution

A closer look at why we adopted pgvector to build a highly performant Similarity Searching Solution.

In an effort to build a highly performant Similarity Searching Solution, we at PlayerZero were forced to dig deep into the options in the market. After days of research, we came to the conclusion that pgvector was going to be the best option for the problem we were looking to solve, but it took a lot of time, testing, and deliberation. Here’s an in-depth look into how we came to this decision and the other options we considered along the way.

The challenge: tell a story with immense and diverse data sets

PlayerZero acts as a single source of knowledge for entire engineering teams. By hooking into SCM, telemetry, ticketing and analytics data, PlayerZero creates product specific models that can be leveraged for practical use cases in the support and development workflow. To be able to do this effectively, we needed to figure out a way to interpret a user's intent using the vast amounts of distinct data sources mentioned above and pull it all together in a clear, cohesive narrative.

Imagine you are a company that runs payroll for Biotech startups, and a customer complaint comes in about not being able to update a key piece of employee in their portal. You would be required to:

  1. Pull together similar complaints to understand the common thread taking place

  2. Cross reference those specific users’ analytics to determine the core flow of what happens

  3. Reference those with specific errors happening only for those users along those specific touchpoints

  4. Finally, you would need to reference those touchpoints to code changes which have taken place recently to identify the source of the issue without lifting a finger.

Now imagine the limitless other possible connections that can take place with the date to tell other useful stories! As you can see, finding a solid solution to this foundational bit of architecture was crucial. 

Where we started

From the jump, we used OpenAI to create meaningful embeddings. We also considered solutions like BERT and txtai, but the use of Open AI's ada002 was never in question from the beginning. The hard questions we were facing were around how to store, retrieve, update, and compare once we had the embeddings.

In memory it is easy enough to write code to do a knn search, but it is not feasible to keep hundreds of gigabytes of embeddings in memory to do so. We could easily write a solution to handle it but then it would take away time and resources from achieving the goal of solving a problem by using embeddings.

What were our options?

When evaluating solutions, there were a few names that popped up as viable options to be evaluated, such as pinecone, chroma, milvus, weaviate, vespa, and pgvector. From a marketing and general first learning perspective, among all these options, pgvector actually seemed to show up on the bottom of the list as it only did one thing (albeit it did it very well). The others, by contrast, marketed potential future problems and issues and the ability to address them as well.

Chroma, milvus, and vespa quickly became the primary focal points of the options looking to have the most ROI tied to them. 

Chroma

Chroma was great as a simple, lightweight dedicated database with LangChain support (spoiler, everything has LangChain support) and was well liked on google reviews. It was also free and open source. 

Unfortunately, the community around it was still on the smaller side and the language support was limited to Python and node at the time of evaluation. Our technology stack is JVM based with Kotlin being the language of choice – this created a long term usability problem as it was not playing to our core strengths.

Milvus

Next up came Milvus, which looked and sounded great. It is a database that looked to do a lot more than just embedding support which was great, but all those extra features came with complexities. 

Just getting off the ground to start playing with it in a HA/FA type of mode proved to be quite difficult, as it supported a wide variety of concepts and features. While this might be useful in theory, it was a lot of work just to get off the ground. Also, at the time we were considering options, language of choice for interacting with Milvus was Python based and was difficult to find good support for those working with it in a JVM stack. It also felt like a lot of little hurdles needed to be jumped through:

  • What were the cloud options?

  • How much did it cost?

  • How self-managed would work?

The community around it was large and active but just felt like boiling the ocean to solve our very simple and specific problem.

Vespa

Vespa looked at first to check a lot of the boxes. It was simple, embedding a specific database with Java components. 

However, with Vespa it became difficult to find documentation around how to do things from both development and self hosting directions. The lack of community support around it made it feel risky, and risk is something we could not afford when working with customers’ data.

Solution

After looking through all the previous options very closely, we decided to create creating a simple pro/con list. Once we analyzed what we needed and what the tradeoffs would be, it was clear that we were not happy with any of the options.

pgvector

Surprisingly, the least glamorous solution was the perfect fit. The option that did what we needed, best, without any of the other frills, was pgvector.

What is pgvector? Pgvector is a simple plugin to postgres that enables a new type of index and a way to query vectors. Postgres itself has a rich history with a vibrant community and is used in enterprises around the world, so we love how we could leverage that with our solution of choice.  It is also extremely simple to use, enabling the POC to get moving quickly.

It is important to note that postgres is a system that we already had deep knowledge of on our team, so we had the trust that it could grow and scale out in the future without worry. The backup/restoration, seeding of data, interactions from any language is easy because it is all the out of the box postgres features. The only tweaks we needed to make were to add an additional jar file to utilize some DSL improvements that make using the embeddings cleaner.

The only downside is one we’ve come to realize more recently – the max embedding size is 2000 dimensions and it seems that OpenAI has recently introduced some newer embedding types that are larger than 2000 dimensions. Fortunately, there is some documentation around how to customize the database instances to support more dimensions, but they come with a performance tradeoff. For now, the quality difference between larger dimensions and smaller dimensions with OpenAI’s 3-SMALL model are small and the quality difference doesn’t appear to be too overbearing.

As time progresses and larger dimensions become necessary, we may have to revisit the landscape of options, especially if pgvector itself doesn’t better handle larger dimensions by then…

The Outcome

Due to the simplicity of pgvector's setup, we were able to quickly spin up a box with ease. We used a simple 2 cpu / 4gb ram EC2 instance to leverage as the database and played with it to see how well it actually performed. Once we felt comfortable with our threshold testing, we felt ready to go into production.

In the past 6 months, the only operational changes we have made was to create a secondary failover instance (just in case), assign alter schema definitions for software evolution, and develop the regular upgrades for security releases of postgres as a whole. We have made 0 operation changes because of pgvector since we started using it over 6 months ago.

For a bit clearer look under the hood: We are running such that a single db node on a generic EC2 instance with over several hundred gigabytes of embeddings is responding back to embedding searches within 20ms. The system rarely goes over 10% cpu utilization and hovers around 2Gb of total RAM consumption. This is truly an efficient embedding solution that has proven to be both faster and cheaper for the total cost of ownership.

Debug any issue down to the line of code,

and make sure it never happens agon

In an effort to build a highly performant Similarity Searching Solution, we at PlayerZero were forced to dig deep into the options in the market. After days of research, we came to the conclusion that pgvector was going to be the best option for the problem we were looking to solve, but it took a lot of time, testing, and deliberation. Here’s an in-depth look into how we came to this decision and the other options we considered along the way.

The challenge: tell a story with immense and diverse data sets

PlayerZero acts as a single source of knowledge for entire engineering teams. By hooking into SCM, telemetry, ticketing and analytics data, PlayerZero creates product specific models that can be leveraged for practical use cases in the support and development workflow. To be able to do this effectively, we needed to figure out a way to interpret a user's intent using the vast amounts of distinct data sources mentioned above and pull it all together in a clear, cohesive narrative.

Imagine you are a company that runs payroll for Biotech startups, and a customer complaint comes in about not being able to update a key piece of employee in their portal. You would be required to:

  1. Pull together similar complaints to understand the common thread taking place

  2. Cross reference those specific users’ analytics to determine the core flow of what happens

  3. Reference those with specific errors happening only for those users along those specific touchpoints

  4. Finally, you would need to reference those touchpoints to code changes which have taken place recently to identify the source of the issue without lifting a finger.

Now imagine the limitless other possible connections that can take place with the date to tell other useful stories! As you can see, finding a solid solution to this foundational bit of architecture was crucial. 

Where we started

From the jump, we used OpenAI to create meaningful embeddings. We also considered solutions like BERT and txtai, but the use of Open AI's ada002 was never in question from the beginning. The hard questions we were facing were around how to store, retrieve, update, and compare once we had the embeddings.

In memory it is easy enough to write code to do a knn search, but it is not feasible to keep hundreds of gigabytes of embeddings in memory to do so. We could easily write a solution to handle it but then it would take away time and resources from achieving the goal of solving a problem by using embeddings.

What were our options?

When evaluating solutions, there were a few names that popped up as viable options to be evaluated, such as pinecone, chroma, milvus, weaviate, vespa, and pgvector. From a marketing and general first learning perspective, among all these options, pgvector actually seemed to show up on the bottom of the list as it only did one thing (albeit it did it very well). The others, by contrast, marketed potential future problems and issues and the ability to address them as well.

Chroma, milvus, and vespa quickly became the primary focal points of the options looking to have the most ROI tied to them. 

Chroma

Chroma was great as a simple, lightweight dedicated database with LangChain support (spoiler, everything has LangChain support) and was well liked on google reviews. It was also free and open source. 

Unfortunately, the community around it was still on the smaller side and the language support was limited to Python and node at the time of evaluation. Our technology stack is JVM based with Kotlin being the language of choice – this created a long term usability problem as it was not playing to our core strengths.

Milvus

Next up came Milvus, which looked and sounded great. It is a database that looked to do a lot more than just embedding support which was great, but all those extra features came with complexities. 

Just getting off the ground to start playing with it in a HA/FA type of mode proved to be quite difficult, as it supported a wide variety of concepts and features. While this might be useful in theory, it was a lot of work just to get off the ground. Also, at the time we were considering options, language of choice for interacting with Milvus was Python based and was difficult to find good support for those working with it in a JVM stack. It also felt like a lot of little hurdles needed to be jumped through:

  • What were the cloud options?

  • How much did it cost?

  • How self-managed would work?

The community around it was large and active but just felt like boiling the ocean to solve our very simple and specific problem.

Vespa

Vespa looked at first to check a lot of the boxes. It was simple, embedding a specific database with Java components. 

However, with Vespa it became difficult to find documentation around how to do things from both development and self hosting directions. The lack of community support around it made it feel risky, and risk is something we could not afford when working with customers’ data.

Solution

After looking through all the previous options very closely, we decided to create creating a simple pro/con list. Once we analyzed what we needed and what the tradeoffs would be, it was clear that we were not happy with any of the options.

pgvector

Surprisingly, the least glamorous solution was the perfect fit. The option that did what we needed, best, without any of the other frills, was pgvector.

What is pgvector? Pgvector is a simple plugin to postgres that enables a new type of index and a way to query vectors. Postgres itself has a rich history with a vibrant community and is used in enterprises around the world, so we love how we could leverage that with our solution of choice.  It is also extremely simple to use, enabling the POC to get moving quickly.

It is important to note that postgres is a system that we already had deep knowledge of on our team, so we had the trust that it could grow and scale out in the future without worry. The backup/restoration, seeding of data, interactions from any language is easy because it is all the out of the box postgres features. The only tweaks we needed to make were to add an additional jar file to utilize some DSL improvements that make using the embeddings cleaner.

The only downside is one we’ve come to realize more recently – the max embedding size is 2000 dimensions and it seems that OpenAI has recently introduced some newer embedding types that are larger than 2000 dimensions. Fortunately, there is some documentation around how to customize the database instances to support more dimensions, but they come with a performance tradeoff. For now, the quality difference between larger dimensions and smaller dimensions with OpenAI’s 3-SMALL model are small and the quality difference doesn’t appear to be too overbearing.

As time progresses and larger dimensions become necessary, we may have to revisit the landscape of options, especially if pgvector itself doesn’t better handle larger dimensions by then…

The Outcome

Due to the simplicity of pgvector's setup, we were able to quickly spin up a box with ease. We used a simple 2 cpu / 4gb ram EC2 instance to leverage as the database and played with it to see how well it actually performed. Once we felt comfortable with our threshold testing, we felt ready to go into production.

In the past 6 months, the only operational changes we have made was to create a secondary failover instance (just in case), assign alter schema definitions for software evolution, and develop the regular upgrades for security releases of postgres as a whole. We have made 0 operation changes because of pgvector since we started using it over 6 months ago.

For a bit clearer look under the hood: We are running such that a single db node on a generic EC2 instance with over several hundred gigabytes of embeddings is responding back to embedding searches within 20ms. The system rarely goes over 10% cpu utilization and hovers around 2Gb of total RAM consumption. This is truly an efficient embedding solution that has proven to be both faster and cheaper for the total cost of ownership.

Debug any issue down to the line of code,

and make sure it never happens agon

In an effort to build a highly performant Similarity Searching Solution, we at PlayerZero were forced to dig deep into the options in the market. After days of research, we came to the conclusion that pgvector was going to be the best option for the problem we were looking to solve, but it took a lot of time, testing, and deliberation. Here’s an in-depth look into how we came to this decision and the other options we considered along the way.

The challenge: tell a story with immense and diverse data sets

PlayerZero acts as a single source of knowledge for entire engineering teams. By hooking into SCM, telemetry, ticketing and analytics data, PlayerZero creates product specific models that can be leveraged for practical use cases in the support and development workflow. To be able to do this effectively, we needed to figure out a way to interpret a user's intent using the vast amounts of distinct data sources mentioned above and pull it all together in a clear, cohesive narrative.

Imagine you are a company that runs payroll for Biotech startups, and a customer complaint comes in about not being able to update a key piece of employee in their portal. You would be required to:

  1. Pull together similar complaints to understand the common thread taking place

  2. Cross reference those specific users’ analytics to determine the core flow of what happens

  3. Reference those with specific errors happening only for those users along those specific touchpoints

  4. Finally, you would need to reference those touchpoints to code changes which have taken place recently to identify the source of the issue without lifting a finger.

Now imagine the limitless other possible connections that can take place with the date to tell other useful stories! As you can see, finding a solid solution to this foundational bit of architecture was crucial. 

Where we started

From the jump, we used OpenAI to create meaningful embeddings. We also considered solutions like BERT and txtai, but the use of Open AI's ada002 was never in question from the beginning. The hard questions we were facing were around how to store, retrieve, update, and compare once we had the embeddings.

In memory it is easy enough to write code to do a knn search, but it is not feasible to keep hundreds of gigabytes of embeddings in memory to do so. We could easily write a solution to handle it but then it would take away time and resources from achieving the goal of solving a problem by using embeddings.

What were our options?

When evaluating solutions, there were a few names that popped up as viable options to be evaluated, such as pinecone, chroma, milvus, weaviate, vespa, and pgvector. From a marketing and general first learning perspective, among all these options, pgvector actually seemed to show up on the bottom of the list as it only did one thing (albeit it did it very well). The others, by contrast, marketed potential future problems and issues and the ability to address them as well.

Chroma, milvus, and vespa quickly became the primary focal points of the options looking to have the most ROI tied to them. 

Chroma

Chroma was great as a simple, lightweight dedicated database with LangChain support (spoiler, everything has LangChain support) and was well liked on google reviews. It was also free and open source. 

Unfortunately, the community around it was still on the smaller side and the language support was limited to Python and node at the time of evaluation. Our technology stack is JVM based with Kotlin being the language of choice – this created a long term usability problem as it was not playing to our core strengths.

Milvus

Next up came Milvus, which looked and sounded great. It is a database that looked to do a lot more than just embedding support which was great, but all those extra features came with complexities. 

Just getting off the ground to start playing with it in a HA/FA type of mode proved to be quite difficult, as it supported a wide variety of concepts and features. While this might be useful in theory, it was a lot of work just to get off the ground. Also, at the time we were considering options, language of choice for interacting with Milvus was Python based and was difficult to find good support for those working with it in a JVM stack. It also felt like a lot of little hurdles needed to be jumped through:

  • What were the cloud options?

  • How much did it cost?

  • How self-managed would work?

The community around it was large and active but just felt like boiling the ocean to solve our very simple and specific problem.

Vespa

Vespa looked at first to check a lot of the boxes. It was simple, embedding a specific database with Java components. 

However, with Vespa it became difficult to find documentation around how to do things from both development and self hosting directions. The lack of community support around it made it feel risky, and risk is something we could not afford when working with customers’ data.

Solution

After looking through all the previous options very closely, we decided to create creating a simple pro/con list. Once we analyzed what we needed and what the tradeoffs would be, it was clear that we were not happy with any of the options.

pgvector

Surprisingly, the least glamorous solution was the perfect fit. The option that did what we needed, best, without any of the other frills, was pgvector.

What is pgvector? Pgvector is a simple plugin to postgres that enables a new type of index and a way to query vectors. Postgres itself has a rich history with a vibrant community and is used in enterprises around the world, so we love how we could leverage that with our solution of choice.  It is also extremely simple to use, enabling the POC to get moving quickly.

It is important to note that postgres is a system that we already had deep knowledge of on our team, so we had the trust that it could grow and scale out in the future without worry. The backup/restoration, seeding of data, interactions from any language is easy because it is all the out of the box postgres features. The only tweaks we needed to make were to add an additional jar file to utilize some DSL improvements that make using the embeddings cleaner.

The only downside is one we’ve come to realize more recently – the max embedding size is 2000 dimensions and it seems that OpenAI has recently introduced some newer embedding types that are larger than 2000 dimensions. Fortunately, there is some documentation around how to customize the database instances to support more dimensions, but they come with a performance tradeoff. For now, the quality difference between larger dimensions and smaller dimensions with OpenAI’s 3-SMALL model are small and the quality difference doesn’t appear to be too overbearing.

As time progresses and larger dimensions become necessary, we may have to revisit the landscape of options, especially if pgvector itself doesn’t better handle larger dimensions by then…

The Outcome

Due to the simplicity of pgvector's setup, we were able to quickly spin up a box with ease. We used a simple 2 cpu / 4gb ram EC2 instance to leverage as the database and played with it to see how well it actually performed. Once we felt comfortable with our threshold testing, we felt ready to go into production.

In the past 6 months, the only operational changes we have made was to create a secondary failover instance (just in case), assign alter schema definitions for software evolution, and develop the regular upgrades for security releases of postgres as a whole. We have made 0 operation changes because of pgvector since we started using it over 6 months ago.

For a bit clearer look under the hood: We are running such that a single db node on a generic EC2 instance with over several hundred gigabytes of embeddings is responding back to embedding searches within 20ms. The system rarely goes over 10% cpu utilization and hovers around 2Gb of total RAM consumption. This is truly an efficient embedding solution that has proven to be both faster and cheaper for the total cost of ownership.

Debug any issue down to the line of code,

and make sure it never happens agon

TESTGRAM INC. © 2024 ALL RIGHTS RESERVED.

TESTGRAM INC. © 2024 ALL RIGHTS RESERVED.

TESTGRAM INC. © 2024 ALL RIGHTS RESERVED.