What’s the predictive value of a new data set? If you augment an existing model with a new data set, how will it perform?
As developers working on AI, as data scientists, as machine learning engineers, these are difficult questions to answer. We spent nearly all summer of 2023 on the road talking with customers asking what their biggest challenges are with AI projects. The universal answer that came back is that it's hard to know the predictive value of a new data set at a glance; rather, data scientists often spend significant time processing and analyzing relationships in a data set and trying many experiments.
Nearly everything that sounds like an AI development task turns out to be a project, and a big part of the problem is that the data preparation is so costly. One customer said that even just a simple tool to help them understand which columns might be good features would be a significant productivity boost.
We want to turn AI development--ML projects--back into tasks by eliminating the data preparation work. We've developed Featrix specifically to help everyone work with data and predictive AI faster--from developers with limited AI skills to the most elite data scientists can benefit from working with Featrix. Forget about feature engineering, data cleaning, data preparation, feature stores, and a dozen other integrations required for machine learning. While you can use those practices and tools with Featrix, it is not necessary and in some cases we actually advise against it.
These are tricky questions. In talking to ML leaders over the last few months, we’ve heard that answering these questions is a project. Teams have to coordinate on the data, and once the data is ready for work, it can take several hours to massage. All of this happens before any ML activities.
We include a powerful UI with Featrix, so you don't have to use our API, but we also have an API that exposes every part of working with Featrix. We are focused on AI for developers, so we have built our API as a first class citizen that is easy to learn and use with confidence.
Let’s take a look at working with Featrix:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import featrixclient as ft
# Split the data
df_train, df_test = train_test_split(df, test_size=0.25)
# Connect to the Featrix server. This can be deployed on prem with Docker
# or Featrix’s public cloud.
featrix = ft.Featrix("http://embedding.featrix.com:8080")
# Here we create a new vector space and train it on the data.
vector_space_id = featrix.EZ_NewVectorSpace(df_train)
# We can create multiple models within a single vector space.
# This lets us re-use representations for different predictions
# without retraining the vector space.
# Note, too, that you could train the model on a different training
# set than the vector space, if you want to zero in on something
# for a specific model.
model_id = featrix.EZ_NewModel(vector_space_id,
"Target_column",
df_train)
# Run predictions
result = featrix.EZ_PredictionOnDataFrame(vector_space_id,
Model_id,
"Target_column",
df_test)
# Now result is a list of classifications in the same symbols
# as the target column
The EZ_PredictionOnDataFrame implements what we call a ‘small’ network which operates as a linear regression.
From this, we can now classify a set of data in our df_test dataframe. Featrix takes care of all the normalization, value imputation of null or missing values, transformations, and so on. You can just feed in raw tabular structures. Your df_test can be partial queries or complete rows.
This vast simplification means you can start evaluating the predictive power of a data set in time between meetings. A junior data engineer can explore the data rapidly without oversight or heavy technical lifts from senior staff.
Exciting, right? Want to learn more about how Featrix connects AI data sets? Sign up and start working with AI. We're providing $10 of credits on the house.
Q&A
-
How does Featrix handle the normalization and value imputation for missing values in the dataset?
There are two parts to this. First, when you decide to train a neural function on one or more data sets, we create an embedding space as a foundational model that represents the data. The embedding space provides what I like to call a contextual transformation of the data. On top of the embedding space, we can layer one or more neural functions--models. These models carry the predictive target and can be trained on the same data as the embedding space, or a subset involving just some of the columns.
New data can be encoded into embedding vectors with the embedding space model, and indeed this is what happens when it's time to run a model: we first encode the raw incoming data to the prediction with the embedding space and then pass it to the model, your neural function. Featrix lets you train as many neural functions as you want per embedding space, which is a huge savings in training and operation costs--as well as vastly simplifying the mechanics of the models.
We believe this approach turns models from "pets" into "cattle" and we're excited about the possibilities here as we work to enable AI for developers.
-
What limitations, if any, does the 'small' network implemented by EZ_PredictionOnDataFrame have in terms of model complexity and predictive accuracy?
The sizing is there mostly to tune latency and costs. As we mature the product, Featrix will report tradeoffs in sizing. There is a little more predictive power in 'large' models but it's a minor detail in most uses of Featrix.
-
Can Featrix's embedding space and model training processes be customized or optimized for specific types of data or predictive tasks?
Under the hood, yes, there are knobs for experimentation and customizing the behavior of the embedding space. The main knobs are which columns to ignore or keep and how many samples to consider when training, which can be controlled in the test/train split parameter.
Our goal is that you should never need to mess with these knobs, but we expose them so that folks who want to experiment with them can do so.
-
How does Featrix ensure data privacy and security, especially when dealing with sensitive or proprietary datasets in its cloud-based service?
For extremely private data, we offer the ability to run Featrix in your cloud or data center. Data is encrypted in transit in our public cloud, we use best practices for secrets management on our internal systems, and our authentication for our application uses standard encrypted API keys and integrates with SuperTokens. We re-evaluate our security posture at least once a quarter and run automated scans on our dependencies to ensure we're up to date. We also offered advanced security in our public cloud upon request, which can include connecting to your storage for keeping data in your cloud and not ours.