Skip to content

Get baseline ML performance on a new data set in just a few steps

What’s the predictive value of a new data set?  If you augment an existing model with a new data set, how will it perform?

These are tricky questions. In talking to ML leaders over the last few months, we’ve heard that answering these questions is a project. Teams have to coordinate on the data, and once the data is ready for work, it can take several hours to massage. All of this happens before any ML activities.

With Featrix, we’ve realized that our approach can enable exploring the predictive power on data sets rapidly.

Let’s take a look:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

import featrixclient as ft

# Split the data
df_train, df_test = train_test_split(df, test_size=0.25)

# Connect to the Featrix server. This can be deployed on prem with Docker
# or Featrix’s public cloud.
featrix = ft.Featrix("http://embedding.featrix.com:8080")

# Here we create a new vector space and train it on the data.
vector_space_id = featrix.EZ_NewVectorSpace(df_train)

# We can create multiple models within a single vector space.
# This lets us re-use representations for different predictions
# without retraining the vector space.
# Note, too, that you could train the model on a different training
# set than the vector space, if you want to zero in on something
# for a specific model.
model_id = featrix.EZ_NewModel(vector_space_id, 
  "Target_column",
     df_train)

# Run predictions
result = featrix.EZ_PredictionOnDataFrame(vector_space_id,
Model_id,
"Target_column",
df_test)

# Now result is a list of classifications in the same symbols
# as the target column


The EZ_PredictionOnDataFrame implements what we call a ‘small’ network which operates as a linear regression. 

From this, we can now classify a set of data in our df_test dataframe. Featrix takes care of all the normalization, value imputation of null or missing values, transformations, and so on. You can just feed in raw tabular structures. Your df_test can be partial queries or complete rows.

This vast simplification means you can start evaluating the predictive power of a data set in time between meetings. A junior data engineer can explore the data rapidly without oversight or heavy technical lifts from senior staff.

Exciting, right? Take our demos for a spin in Google Collab and see what you think!