Vector embeddings have enabled semantic search of unstructured data, as easy as field-based querying of structured data in traditional databases - isn’t that crazy? Featrix brings embeddings to structured data in an automated way and offers "embeddings as a service" for structured, tabular data.
Vector databases have become widely available, and typically don’t generate embeddings themselves. While initially integrated with the FAISS library, Featrix is designed to work with any vector database.
This multi-part blog series introduces you to how Featrix makes vectorizing structured enterprise data easy. We’ll begin by bringing everyone up to speed on how vector databases work.
Vector databases are designed to store high-dimensional numeric (“vector”) representations for fast retrieval and similarity search. Like traditional databases, they also support basic data operations like creating, reading, updating, and deleting, known as "CRUD" operations. Unlike traditional databases, the characteristics of the data aren’t captured in fixed fields, but in the numeric representation (known as embedding) that was generated by applying a machine learning model trained on lots of data (the “embedding model”). The magic arises from the machine learning model capturing the meaning and context of the unstructured data.
Vector databases are commonly used to search for documents that are similar in meaning, find what you mean in a way, also as “semantic search”. Not only can you search text, but - provided you have appropriate embedding models - you can semantically search other modalities like images and audio. Furthermore, a query in one modality can let you find related data of a different modality, for example, retrieving images using a textual description, and that is known as “multimodal” search.
In the context of generative AI, vector databases provide efficient storage and retrieval of the intermediate representations used by the various steps of a sequence of tasks. Those representations include vector embeddings of text like we described above, but also summaries of documents that are then classified by their emotional content or other properties.
In the context of NLP (Natural Language Processing), transformer models also commonly use vector representations of the language. Based on those representations, classifying sentiment, extracting named entities, summarizing or translating pieces of text are performed using fundamentally the same efficient search in high dimensional spaces that semantic search requires.
Finally, the vector approach is also used in personalizing search result, by modeling user characteristics in a vector representation, and adjusting the search result using the characteristics of (in a vector sense) “similar” users.
When ChatGTP became popular, most users simply asked questions and were amazed at the quality of the answers provided - though this is fundamentally very similar to the semantic retrieval that we described above. Large language models (LLMs) and vector databases serve different purposes: while LLMs primarily are used in a generative function, vector databases specifically perform similarity searches.
As explained above, vector databases provide a means to store unstructured data and perform similarity search, and the embedding model that generates the vector representations is typically separate from the vector database. Since the release of ChatGPT a whole array of generative models have become popular for tasks like coding, speech recognition, and image generation. These “foundation models” can handle various types of unstructured data, not just text - just like vector databases are agnostic to what data the embeddings represent.
By contrast, LLMs are trained on vast amounts of text, and vast in size - typically in the order of billions of parameters. LLMs are designed for natural language understanding and generation tasks - much more than just retrieval. Embedding models employed in vector databases are an order of magnitude smaller, tens - hundreds million parameters.
Vector Database |
LLM |
|
Capability |
Only search / retrieval |
Variety of tasks |
Accuracy |
High (assuming high quality of embeddings) |
Can hallucinate if not adapted or combined with domain-specific data |
Cost |
Moderate |
High (per-token charge by vendor) |
Privacy |
Your data remains in your control |
Have to share your data with 3rd party |
First and foremost, it’s difficult to obtain performant embeddings. Off-the-shelf embedding models perform on general English but deteriorate on specialized domains. Adapting an embedding model on your domain is a complex process that requires significant expertise and resources. Worse, none of the widely available models handle structured data well. That’s where Featrix comes into play.
Other challenges with vector databases include:
AI-powered systems can be decomposed into three fundamental layers as shown in the figure below:
The next blog in this series dives into why standard embedding models and LLMs don’t handle structured data well, why combining structured with unstructured data poses significant practical challenges, and how Featrix provides an elegant solution to those challenges.