Skip to content

The Data Fusion Challenge

At Featrix, we want to make things easy. It should be easy to work with your data; it should be easy to train and explore new models; it should be easy to understand your data from a variety of machine learning and analytical perspectives.

Today it’s not easy to do things with data other than maybe store it.

Linking together data sets is traditionally difficult for a variety of reasons:

  • Data sets are noisy and sometimes the noise makes it unclear what the right linkage should be.
  • The unique keys to join records may span multiple columns in one or more of the data sets.
  • Linking is not always 1-to-1; sometimes it’s 1-to-n or n-to-m.
  • It’s difficult to examine the results of linking and evaluate if the result is what you wanted.
  • Data may be missing, either in just a few records, or columns from an entire data set, which prevents linking enough records to be useful.

We have been working on a number of customer accounts in the last few months that all require data fusion in some manner. These customers have required linking internal data with other internal data, internal and public data, public data and public data, and internal data with purchased data from a data vendor.

data-fusion-graphicIn all cases, our customer wants to learn something about the data–some customers want to build tools to sell ML-powered solutions to their customers; other customers want to do analysis on the fused data sets to somehow enable or understand their businesses.

Across industries we have seen the same problems come up over and over, particularly when dealing with address data and time-based data. 

One customer needed to link together three public data sets to create a demo ML model for gauging risk on physical buildings in various cities. With Featrix, we were able to encode a multi-modal vector space on the disparate input data sets and then enable the customer to query the vector space and build downstream models for their own risk metrics. The project required no data cleaning or manual linking of the data prior to ingesting the training data into Featrix.

We have also been working on a similar project to encode and enrich Salesforce data with a marketing services company. They have a mix of internal data dumps from their Salesforce, data from their customers, and data they buy from various data vendors. With Featrix, our solutions engineer was able to encode these data sets and provide enriched information to the customer leveraging Featrix’s built-in vector database tooling, including nearest neighbor searches and clustering of the data. These features are accessed by embedding the tabular data with Featrix and then accessing the clustering and nearest neighbor search functions in the Featrix API.

As we continue to build out these features, we'll have more to share.