Featrix Blog

Vector Databases and Embeddings

Written by Bernhard Suhm Ph.D. | Dec 13, 2023 11:24:15 PM

Vector embeddings have enabled semantic search of unstructured data, as easy as field-based querying of structured data in traditional databases - isn’t that crazy? Featrix brings embeddings to structured data in an automated way and offers "embeddings as a service" for structured, tabular data. 

Vector databases have become widely available, and typically don’t generate embeddings themselves. While initially integrated with the FAISS library, Featrix is designed to work with any vector database.

This multi-part blog series introduces you to how Featrix makes vectorizing structured enterprise data easy. We’ll begin by bringing everyone up to speed on how vector databases work.

 

What is a vector database?

Vector databases are designed to store high-dimensional numeric (“vector”) representations for fast retrieval and similarity search. Like traditional databases, they also support basic data operations like creating, reading, updating, and deleting, known as "CRUD" operations. Unlike traditional databases, the characteristics of the data aren’t captured in fixed fields, but in the numeric representation (known as embedding) that was generated by applying a machine learning model trained on lots of data (the “embedding model”). The magic arises from the machine learning model capturing the meaning and context of the unstructured data.


What are vector databases used for?

Vector databases are commonly used to search for documents that are similar in meaning, find what you mean in a way, also as “semantic search”. Not only can you search text, but - provided you have appropriate embedding models - you can semantically search other modalities like images and audio. Furthermore, a query in one modality can let you find related data of a different modality, for example, retrieving images using a textual description, and that is known as “multimodal” search.

In the context of generative AI, vector databases provide efficient storage and retrieval of the intermediate representations used by the various steps of a sequence of tasks. Those representations include vector embeddings of text like we described above, but also summaries of documents that are then classified by their emotional content or other properties.

In the context of NLP (Natural Language Processing), transformer models also commonly use vector representations of the language. Based on those representations, classifying sentiment, extracting named entities, summarizing or translating pieces of text are performed using fundamentally the same efficient search in high dimensional spaces that semantic search requires. 

Finally, the vector approach is also used in personalizing search result, by modeling user characteristics in a vector representation, and adjusting the search result using the characteristics of (in a vector sense) “similar” users.

 

How are they different from LLMs?

When ChatGTP became popular, most users simply asked questions and were amazed at the quality of the answers provided - though this is fundamentally very similar to the semantic retrieval that we described above. Large language models (LLMs) and vector databases serve different purposes: while LLMs primarily are used in a generative function, vector databases specifically perform similarity searches.

As explained above, vector databases provide a means to store unstructured data and perform similarity search, and the embedding model that generates the vector representations is typically separate from the vector database. Since the release of ChatGPT a whole array of generative models have become popular for tasks like coding, speech recognition, and image generation. These “foundation models” can handle various types of unstructured data, not just text - just like vector databases are agnostic to what data the embeddings represent.

By contrast, LLMs are trained on vast amounts of text, and vast in size - typically in the order of billions of parameters. LLMs are designed for natural language understanding and generation tasks - much more than just retrieval. Embedding models employed in vector databases are an order of magnitude smaller, tens - hundreds million parameters.



 

Vector Database

LLM

Capability

Only search / retrieval

Variety of tasks

Accuracy

High (assuming high quality of embeddings)

Can hallucinate if not adapted or combined with domain-specific data

Cost

Moderate

High (per-token charge by vendor)

Privacy

Your data remains in your control

Have to share your data with 3rd party 

 

What are the challenges in using vector databases?

First and foremost, it’s difficult to obtain performant embeddings. Off-the-shelf embedding models perform on general English but deteriorate on specialized domains. Adapting an embedding model on your domain is a complex process that requires significant expertise and resources. Worse, none of the widely available models handle structured data well. That’s where Featrix comes into play.

 

Other challenges with vector databases include:

  • Data updates, including adding new data introduce significant latencies, since the data structures that support efficient search in high dimensions depend on the totality of searchable data, and popular approaches like HNSW graphs gradually deteriorate with incorporating new data until they’re rebuilt from scratch.
  • Lack of interpretability of embeddings
  • High dimensionality of representation: storing and efficiently searching high-dimensional data is computationally expensive. “Curse of dimensionality” counteracts accuracy - similarity becomes less distinct the higher the dimensionality.
  • Last but not least, there is the difficulty of handling structured data alongside embedding the unstructured data.

 

AI-powered systems can be decomposed into three fundamental layers as shown in the figure below:

  1. At the top, the actual AI models that work on representations of the data and - depending on the use case - deliver analytics, predictions applying classic supervised or reinforcement learning, or generate content in the case of generative AI.
  2. In the middle, the representation of the data, which in classic machine learning typically were features derived from raw structured data, while signal and image processing provided features for some types of unstructured data. With the advent of transformer based NLP and vector search, we’ve been able to represent unstructured data of various types, including text, using embeddings. But standard embeddings can’t handle structured data, and that’s where Featrix comes in.
  3. At the bottom, the storage layer where the actual data and its representations reside: raw data and derived features in traditional relational or NoSQL databases, and embeddings in vector databases; and these databases physically store the data either in on-premise or cloud storage.

The next blog in this series dives into why standard embedding models and LLMs don’t handle structured data well, why combining structured with unstructured data poses significant practical challenges, and how Featrix provides an elegant solution to those challenges.