Vector embeddings have enabled semantic search of unstructured data, as easy as field-based...
Vector Embeddings: The Magic Behind Today’s AI
Vector Embeddings and the Power of Featrix
In this video, Pawel, our CTO of Featrix, delivers a master class in vector embeddings. He explains what vector embeddings are and he explains how they have become the lingua franca of AI. These days, embeddings power everything from image generation to recommendation engines. He shows the powerful capabilities of Featrix. Featrix enables AI teams to make personalized embeddings using their own tabular data. With Featrix, AI becomes accessible to all developers: Developers of AI systems using Featrix no longer have to clean, fill in missing values, standardize, or combine data for different tasks.
At Featrix, we want to enable developers to add predictive AI features to their apps with the Featrix API. This can be done easily, no matter what the developer skill level is.
Video Transcript
Hello, my name is Pawel, and I'd like to tell you a little bit about vector embeddings, and why I believe they are the magic behind today's AI. And hopefully by the end of this presentation, you'll agree with me that embeddings are an emerging superpower for developers.
First, a little bit about myself, I am the CTO at Featrix, where we help developers build custom embeddings for tabular data. I have about a decade of experience in machine learning and probability. I did research in physics and thermodynamics, and then I worked on language models and tabular data at a DARPA funded startup. I am also a software engineer and developer, and I developed a full stack web app and ran the associated solo bootstrap business.
What are Vectors?
The most basic representation of a vector is just a list of numbers. Another representation that is very common is that a vector is an arrow.
This works in two dimensions or in 3D, but remember that here we're talking about vectors that might have dozens to thousands of dimensions. So these vectors are very, very large. And so even though in this presentation I'll be using arrows to represent vectors, just remember that we're talking about really, really big vectors.
One of my favorite interpretations of vectors is that they represent layers in a neural net. And therefore, they represent the state of an ongoing computation, so you can think of the layers of a network as separate stages in the computation that the network executes. And so a vector can just be thought of as a particular stage in that computation process. And it represents a specific layer of a neural net.
The vectors are great for computation because they make computation a lot more efficient. If you can take your computation and represent it as a sequence of operations of vectors, they can be executed very effectively because vector operations are easy to parallelize and they can also be executed on hardware accelerators such as GPUs. Even though it makes a lot of sense to try to represent your computation in terms of vector operations, I would argue that it makes even more sense to represent your data as vectors.
And this is because vectors come with an operation called the dot product. This is an operation that most of us learn in high school math, but they really help us bridge the gap between how computers represent information and how people think.
Computers represent information in zeros and ones. Something is either exactly the same or completely different from another object. But humans think associatively. And this is what vectors allow us to represent.
The vectors have an entire spectrum of similarity from being completely aligned to going in completely opposite directions. And it's this ability of using vectors to represent a degree of association that makes it such a potent language for expressing computations in machine learning.
Finally, we can use vectors as a universal encoding of sorts. So every bit of data comes with its own encoding format the way that the data is captured and stored on disk. But when we're operating with machine learning, we need that data to be represented as vectors.
In particular, embeddings can be trained such that all these different types of data are represented in a way that can be consumed in homogeneous fashion by downstream models. And so that makes computation very convenient and also makes the downstream systems easier to deal with. So we think it's a good idea to encode different types of data as vectors.
What is an Embedding?
But how is this done? What is the process of taking data in its original form and converting it into vectors? Well, this is exactly what we call an embedding. It's a function that takes any set of objects and represents them as vectors. And the objects that go into an embedding can be just about anything. Most people are familiar with word or text embeddings, where a piece of text is represented as a vector. But this idea is applicable to just about any set.
So you can embed images. You can also embed user profiles or objects in e-commerce catalogs. You can embed movies and songs, as well as traditional data types like integers, floats, and so on.
When you train such a function, you get an object that we call an embedding space. It's not just a straightforward vector space, which is just a mathematical object.
In our terminology, embedding space is a collection of vectors that come from applying this function to a particular set of objects. And in particular, this function has the property that the dot product between different vectors that comprise this embedding space has a meaning in terms of the similarity of the objects. We'll talk about it a little bit more.
Alright, so what is all of this good for? Why do we want to transform data into this vector form? The reason is that the data represented in different formats in different ways allows for different computations to be easy or difficult.
So here consider an image represented in RGB format as a sequence of triplets of numbers, where each number represents the value of intensity of a given color channel. So here we're representing a 3x3 grid of pixels that came from this image of a cute kitty. When data is represented in this format, some operations are very easy to do, such as making a color histogram. This is easy because we don't need to interpret the data. The data are just numbers and we're treating them as numbers in this operation.
Something that's a little bit more complicated is, say, adjusting the contrast of the image. This is because now we have to treat these numbers as parts of an image and we need to consider different linear combinations of them.
Something that's far more difficult is operating not just on the image itself, but on the contents of that image, such as when we want to move the cat a little bit to the left. Such an operation requires operating on nonlinear combinations of the different pixels that compose the image. Therefore it's really hard to do computationally when the image is represented in this way. And so the way in which the underlying data is represented strongly affects the kinds of things we can do with it easily.
In machine learning, typically the first step of any pipeline is to take the data and represent it in the form of a vector. And if you go to scikit-learn, you'll find that it contains a lot of methods for doing that, for taking data and representing it as a list of numbers. However, that list of numbers can get very long. For example, one hot encoding can create vectors with tens of thousands of features. And also, it's a collection of vectors where the dot product between these vectors isn't very meaningful.
Ideally, we'd like to have representation of the data that's vectorized, but it's also very information-dense. We take that information from these large, somewhat naive vectors, and we compress it into a lower number of dimensions, such that the vector product between different elements of that embedding is meaningful. And so embedding is exactly that process of taking these large naive representations of the underlying data and making them a lot more information-dense and suitable for processing.
So how is that done? Well, there's different ways of training an embedding, but one of the most common recent ones, works by taking in pairs of objects that are related in some way, and enforcing that they are embedded to locations in the embedding space that are close to each other.
Let's say that objects A and B are similar, and in this case, we mean that they co-occur frequently together in the data set. So for example, if there were words, that might mean that they are words that are neighbors in a piece of text. If they were images, A and B could be different views of the same image, or they could be images that represent the same idea and so on.
We train the embedding such that they end up close together. So we can represent that in 2D as a pair of aligned arrows, or we can represent it in the say 3D or higher number of dimensions as a pair of points in some higher dimensional space.
And when two things are different from one another, In this case, they appear infrequently together. They are embedded to locations that are far apart from one another.
So similar things go together, different things go apart. And to make this a little bit more precise, let's say we have an embedding that embeds images and pieces of text, and we trained it so that images and text that represent the same general idea are embedded to similar regions in the vector space.
Embeddings convert “similarity” to proximity in embedding space
So here we have a picture of a cat and its description, and they're both embedded nearby in this region of the embedding space. But an image of a dog and its description are embedded somewhere else but still close to one another. So a good embedding space is one where similar things go together and different things are separated.
And so why does this matter? Well now that we have this information-dense vector representation, some computations are much much easier to do.
And so an operation like moving cat to the left, if we take as input this already compressed representation of the image is far simpler and more straightforward than if we were to take the image represented as bytes in the way it's captured and stored. And so representations or vector representations that are trained in this way, or vector embeddings allow us to bust through this computational barrier.
And as a result, there's a lot more of life that's subject to computation right now. We can do things right now that we couldn't do before and vector embeddings are at the heart of this capability. And so I think we're going to be witnessing far, far more computation happening in the near future than we had in the past. And we're going to have to manage and organize all of that computation somehow.
And I think that we are seeing the emergence of a three-tiered structure for the computational systems that are going to be necessary to fully utilize the power of AI. We're seeing a lot of development happening right now in what I would call the reasoning, interaction, and generation layer.
This is the layer that most users interact with. This is where they provide their prompts. This is where they read the outputs. This is where they see the generated images, but also this is where they get the predictions or other outputs from analytical systems or this is also where systems that utilize reinforcement learning would reside.
At the bottom of the stack, there is a storage layer, which consists of the very common and very popular different types of databases as well as data lakes and data warehouses, but also the contents of vector databases which have come onto the scene relatively recently.
And we believe at Featrix that what's missing from this picture is a middle representation layer that essentially translates all of the information and storage layer into a form that other AI systems can find easy to digest and operate on. And right now, this layer isn't getting a lot of attention because some of the foundational models for text, images, and audio play that role. They have internal representations that they can operate on.
However, as AI is making its way into the enterprise and as the scope of the sort of computation we'd like to do with AI and the scope of the computation we'd like to do overall in the enterprise increases, the custom vector embedding representation of many, many new data sets are going to be needed that do not necessarily conform to the expectations of the models already published. And so we believe that this representation layer will become a crucial element of machine learning systems in the future because all of the data will have to flow through this layer.
Mental Models for Embeddings
Now I'd like to go through a list of ways of thinking about embeddings that I find particularly helpful. I've been working with embeddings for a couple of years, and time and time again I find myself going back to specific ways of thinking about embedding-specific mental models. This is a list of my own personal favorites. It's not supposed to be exhaustive and those different ways of thinking about embeddings are not necessarily mutually exclusive, so they have quite a bit of overlap. Nevertheless I find that having these mental models in my toolkit helps me reason about embeddings easier and helps me understand how they can solve different problems better.
An interface for composing neural networks
The first mental model is that embeddings serve as an interface for composing neural networks. Data in any modality can be passed through an encoder (typically, it's mode specific), but more and more frequently we find encoders that are multimodal. And once it's encoded and once that information is compressed into a vector form, it can be passed into multiple downstream networks that are trained to solve a particular task, such as a prediction or a classification. And this architecture means that the data only has to be encoded into a vector once as opposed to having to do it multiple times in networks that are each starting from scratch. And so it makes computation a lot more efficient in that way.
Dimensionality reduction
A second mental model is that embeddings are great for dimensionality reduction. Whenever we see a new data set, we typically ask questions like, what's in here? What can I do with this data? And what other data can I expect to see in the future? Those are the kind of questions that dimensionality reduction is very helpful for.
And my favorite example of this is a project that was done by researchers who were looking at the use of embeddings and trying to design better drugs and pharmaceuticals. What they did is they trained an embedding space using a method that is slightly different from the one I introduced earlier. But they trained a network that the researchers were able to navigate such that they could move in direction that would make a molecule’s physical and chemical properties better. And so they could use this space because all the information was condensed, they could navigate that in a much more efficient way, and essentially arrive at a point that corresponded to a molecule of a different structure. And so this paper that I'm referencing here is from about five years ago, but since there have been multiple companies started with the express purpose of using this technique for designing better drugs.
A trainable abstraction - multimodal embeddings
The third mental model and my personal favorite is that an embedding is a form of an abstraction, and a great example of this are multimodal embeddings. So those are embeddings that can process different types of data such that objects that represent similar ideas end up in a similar place in the embedding space. So here consider a picture of a cat and a description of this picture both being embedded in similar locations. Now we can take the resulting embedding vectors and we can pass them as inputs to a downstream model. And let's say this model could be a simple neural network that is meant to predict whether a given piece of data has something to do with cats or not. So in this case, the fact that we're creating this embedding first and then when we're passing into the downstream model means that the downstream model doesn't have to care what the modality of the original data was. It just needs to look at whether something has to do with cats or not. And so it makes it simpler, it makes for an overall system that is less complex. And that's the primary role of abstractions in computer science in general. Multimodal embeddings and using embeddings as an abstraction also shows up in models where we train systems for image generation.
So here, typically what happens is that the system ingests examples of images we would like to reproduce eventually. And so we take these images and we represent them as embeddings using an appropriate model. Then we feed that into the diffusion model which is trained to reconstruct the original image that was fed into the network. And here I'm giving this script another very high level. This is very simplified. And so there's a lot more detail to it. When we query the model, for example, when we want to create an image from a textual description, we take the text and pass it through the text embedding model that was trained jointly with the image embedding model, and we get a text embedding. And from that text embedding, we can use the same diffusion model to generate an image of a dog. And so here, the diffusion model doesn't really have to care whether the original description of that image came from an image itself or from a text query. It just does the same thing regardless.
A trainable search index for a database
And finally, we have an interpretation of embeddings as a trainable search index for a database. This is something that's been happening a lot recently and this is one of the reasons why vector databases have become so popular. Let's say that you have items in some database that you're interested in querying and you represent items in that database as vectors. Here, represented in 2D. Then you have a query, which you also represent as a vector, and what you find is that this query aligns with some vectors better than others. So the element that is represented by the query vector is more similar to some items in the database than to others. And so as a result, we can return the elements in the database in the order of similarity to the query vector. And in practice, what this means is that if the items in the database are document snippets and the query is a user prompt, the results represent the most relevant documents in the database. And this is a technique used in what's called retrieval augmented generation, which is very commonly done these days in the context of LLMs. If the items in the database represent movies, songs, or images, and the query vector represents a user profile, the results will contain content that the user is most likely to engage with. And so most recommendation engines online today will work in this way. Finally, if the items in the database are objects in an ecommerce catalog and the query is a shopping cart of a user, the results will contain elements in the catalog that the user is most likely to buy in the future. This is very helpful in sales and marketing.
Embeddings are perfect for tabular data
At Featrix, we believe that the barrier to computation that I mentioned before is not only a feature of unstructured data. Here, I have an example table and different types of computation have different levels of difficulty when data is represented in this way. Something that's very easy to do is to sum up order volume by customer. You can do this with a simple SQL query. And these days you can have an LLM write a SQL query for you. Something that's a little bit more difficult because it involves a degree of interpretation of the data, is computing revenue for say the third quarter of 2023. And something that's a lot more difficult is say identifying the customers who are about to churn. Recall the example of the image operating not just on how the data is represented, but on what it means is something that is very difficult for computers to do, and this is the computational barrier that embeddings help us jump over.
We think that embeddings are perfect for this kind of a task, because all of the mental models that I talked about just now show how embeddings can be helpful when dealing with tabular data. So first of all, embeddings can be used for composing neural networks, and this means that we can take our tabular data, we can embed it, and then we can feed it to downstream models, in a way that makes these models much simpler, because much of the information has already been extracted out of the data.
Embeddings can also be used for dimensionality reduction on tabular data. One of the consequences of dimensionality reduction is that you get to choose which information to keep and which information to reject. And because of how embeddings are trained, they gain a degree of noise resilience because they reject much of the information that constitutes noise. And this means that through embedding you can deal with issues that are very common in tabular data, such as missing values, duplicate values, de-normalized values, and a good embedding will help you handle many of these things without you having to worry about it. And embedding can also be used as a search index for retrieving the most similar items. This is perhaps one of the most straightforward and most obvious applications of embedding some tabular data.
And then finally, embeddings can be used as a trainable abstraction for tabular data. And I'm particularly excited about this use case because, in effect, it can be used to join data across tables. And we at Featrix like to call it auto-join. And so this is why Featrix was created.
Featrix is the first commercial service for training embeddings for structured data.
Training your own custom embedding is something that's very difficult to do. Like I said, embeddings exist for most common data types, but if you're dealing with custom multi-modal data or with tabular data, that can be very tricky to embed correctly, so that you get all the benefits of an embedding and not just end up with a random collection of vectors. So far we've created an API that allows you to upload a dataset and get access to the resulting embedding space that we train for you, and we provide access to that embedding space for you so that you can embed partial or entire records. You can create a trained downstream model based on that embedding space and so on.
Featrix in Action
So now I'd like to take you through an example of what creating applications for tabular data sets can do for you. And this is a simple toy data set that we use to show the basic idea, but we've done the same thing on data sets that contain dozens of columns and hundreds of thousands of rows.
So here imagine you have a data set of heights and weights of different pets, cats and dogs to be specific. Let's say you work at a veterinary clinic and you like to collect the height and weight of the pets that come to your clinic. But unfortunately, you gave the task of collecting this data to two interns that maybe didn't talk to one another and they ended up collecting data that's not quite what you asked for. Specifically, there is no one record in this data set that contains all three pieces of information that we'd like to have for every animal. And so half the data has the pet type and height, but not the weight, and half the data has height and weight, but not the pet type.
This is a situation that's very common when dealing with tabular data. Data sets often come from different sources, and were created without communication, And so they might not always contain the data you would need to join them. Here, for example, the height is a scalar, which means that if there's any type of noise in this data, as there is here, you would not be able to join these tables together using the traditional or standard SQL join because the values just don't match up. You'd have to use some sort of a SQL method to try and stitch these two tables together.
With Featrix auto-join, this becomes much simpler because all you need to do is send the data to our API, and we will create an embedding space, as you will see in a second, that has the properties that you would expect of a data set that has the missing data filled in. So to tell you exactly what I mean by this, consider that we create the embedding space for the data and then we pull out values from each column. We embed them and then we look for how similar are the embeddings for the individual values in one column to the individual values from another column.
Here, color yellow indicates that the two embeddings are very similar and color blue indicates that the embeddings are dissimilar. And so here we see that when we plot the similarity for embeddings for the type of a pet versus the height of the pet, we see that the shorter animals are much more associated with cats than with dogs, whereas taller animals are much more associated with dogs than with cats. Now this image is not quite a probability distribution. It is just the vector similarity between the embedding for cat in this case and the embedding for 18 inches. And we see that these two embeddings are very similar. And we see that the embedding has structure that is non-trivial and that is related to the information contained in the data set.
When we create a similar plot for height versus weight, so here the rows correspond to the different heights that the pet can have and the columns correspond to the different weights that a pet can have in pounds. We see that these two are closely related in a way that we would expect, which is that the taller animals are heavier, but we also see that the heavier and taller animals are much similar to each other than they are to the shorter and less heavy animals. The most important element of this comes when we compare the embeddings for the type of the pet and its weight. So here we see that this data didn't exist in the direct comparison between these two pieces of data with that would not be possible in the original data set, because as you recall, that type of the pet and its weight were never in a single row together.
However, because we are able to train an embedding whose structure reflects the overall relationships in the data, we see that in the embedding space, we recover the sort of relationship between the weight of the pet and its type that we would expect to see, namely that heavier pets are much more associated with dogs than with cats. And this is the basic capability that auto-join gives you. You don't have to worry about joining the data directly.
So once we have that embedding, we can train interesting models on it. So here, this is a snippet of code that produces a new downstream model based on that embedding. And we give it data that contains heights and types of pets. So the training data for this model does not include model weight. However, because we now have this joint embedding, we can query this model with information about the weight of the pet. So here you see, we query it with weights from 10 to 45 pounds, and we can actually run predictions on this model, even though the model was not trained on the weight information. When you recall the slide about the cat about how image generation models these days are trained, this is a very similar idea. So you train the model based on the association between height and weight, but then when it comes to prediction, you can query the model with the weight of the animal. And here we see that as you go from small weights to high weights in prediction, the prediction goes from being very certain that something is a cat to being very certain that this is a dog.
Again, this is just a simple toy dataset, but it showcases this ability to generate non-trivial structure frames on tabular data that can be helpful for recovering relationships across tables that may not exist in the same table.
Embeddings: An Emerging Superpower for Developers
So to sum up, we believe that embeddings are an emerging superpower for developers. They are responsible for the scope for computation expanding very significantly and we're very excited about all the new types and modes of computation that we will be able to do in the near future. Vector embeddings are the lingua franca of AI. Any AI system requires embeddings to operate and embeddings constitute its lifeblood.
Embeddings can supercharge your applications with AI. If you have a really well trained embedding space for your data, you can perform computations that are much more complicated and much more valuable than you would be able to perform on the native representation of that data, which was captured and stored.
Finally, off the shelf embeddings are available for most common data types. If you're dealing with text, images, or audio, there's models out there that you can use to embed your data today. If you're dealing with tabular data sets or data sets that have mixed modalities, or otherwise do not fit with the models that were published online or open source, you would have to train your own embedding or you can give Featrix.ai a shot.
And so finally, this is just a little joke – embedding is unfortunately a word that has several meanings and so sometimes it could be a little bit confusing. Embedding can relate both to the function that performs the embedding but also to its result and the process of actually applying that function.
Embedding: the process of creating an embedding by embedding.
And so hopefully that was not too confusing for you. If you'd like to learn more about what we do at Featrix.ai, or if you'd like to learn more about embeddings, please visit us at Featrix.ai.
If you have any questions about embeddings or our offering, you're free to reach out to me directly at pawel@featrix.ai.