So, you've trained your classification model. Now you have a randomized, unlabeled test data set,...

# Making AI Predictions Across Data Sets with Featrix

While talking to many people about the powerful capabilities of Featrix, we've realized that it would be beneficial to show a very quick and basic demo to explore the essence of Featrix's capabilities. This minimal example involves just two data sets and three columns -- a mere fraction of what Featrix can do. To see some meatier examples, check out our other demos.

## Automatically Joining AI Data Sets with Featrix

When we're building machine learning models, we often want to bring data together from different sources: different tables, different CSV files, or even completely different databases. Typically, information from these different sources is encoded slightly differently.

Bringing together these sources is often challenging: it tends to involve manual scripts, SQL trial and error, and tedious debugging. The work we are doing with Featrix not only makes this much easier, but improves model accuracy as well.

There are several papers about improving model accuracy. One of the leading ways is to bring new data to a model. Creating AI training data sets from multiple sources of data is a powerful way to improve your learning model. This makes intuitive sense: Adding more data provides the model more context to draw from.

In this demo, we show how to use Featrix to automatically join data from multiple sources so that you can make predictions that span the data sources without having to join the data yourself.

In other words, Featrix provides the ability to make a prediction on a variable in one source based on data from another source!

This is powerful. Consider data from two different applications or departments. For example, data may originate from sales and marketing, and each data source has its own representations for business data, which makes it expensive to join the data manually. But your boss probably doesn’t care about that: You need results.

We’re building an easy button for this kind of task.

In our toy example, we've got data about the heights and weights of cats and dogs. As shown in the plot below, the combination of heights and weights shows no discernible clustering. The data is mixed.

To predict whether an animal is a cat or a dog requires knowing something about *both* the height and the weight.

In our first data set, we have height and weight data. Note that there's nothing about what type of animal is involved:

In our second data set, we have data for the heights and type of animal:

Note that *pet_type* = “Dog” or “Cat.”

## Training the Embedding Space

We're going to bring these two data frames together into a single vector space. To do this, we will train a network that represents the vector on these two data sets. Generally, we would split our available data into train and test sets, so we could validate the results. We do the split here, but for the sake of brevity, we aren't going to use the test data in this demo.

Now we've got embeddings in this vector space: *pet_type*, *height*, and *weight*.

## Visualizing the Embedding Space

We can look at the relationship between the embeddings for the type of pet and for the height. What we see here is that the embeddings for *Cat* and the embeddings for animals 20 inches and under are similar.

We also see there’s a little bit of fuzziness when an animal is 20 to 21 inches tall where the embeddings start to be more similar to *Dog*.

We can do this same comparison between the embeddings for the weights and the pet type as well. Let’s take a look:

Under the hood, our function *PlotEmbeddings* samples the domain of the variables specified, and then it constructs the embeddings for just those variables.

Here we specify *pet_type* and *height*, so we construct embeddings for just the set of *pet_type* values and just *height* values. Next, we take these vectors, multiply the first one by the transpose of the second, and that gives us their cosine similarity as a matrix, which is what these plots show. The structure of these embeddings comes from the training of the vector space we did above—and it is that structure that gives the embeddings their semantics and lets us compare them in a meaningful way.

At this point, we've traversed across the two data sets. We have height and weight from one data set and the pet type and height in the other data set.

We have our embedding space and now we can build a simple model that leverages the structure of this space.

Our model will predict the type of animal from the weight. The *pet_type* is our target column, and we train the model on the data set that contains the *pet_type* and *height* – but *the model is not trained on the weight data at all*.

This is a simple neural network—it just has two layers and it trains quickly.

Now we can use the model to make predictions!

To test the model, let’s ask the model about weights from 10 to 50 lbs at 5 lb increments. First, we set up the list of queries we want to make. We can run multiple questions at once with multiple variables or just a single variable. Here we just specify the weight.

Note that we could also go back and use our test data from the original data set.

Now we run the query and print out the results:

You can see as we move up through the range of weights, we start predicting a higher chance of the animal being a dog.

## Power and Ease

What we’ve done is train a vector space on multiple data sets and then train a simple model to leverage the structure of that vector space to answer a question that spans multiple data sets. We didn’t need all the data to build the model, nor did we need a particularly complex model to leverage this power!

This simple demo shows just a sliver of the power of Featrix. Featrix supports joining data in this manner across more than just two data sets: we have done projects involving numerous sources of data spanning hundreds of columns. For beefier examples that you can tinker with, check out our live demos.

Also, be sure to check out our posts about adding new data sets and the challenges of data fusion.