If you’ve heard about multimodal representations, it was probably in the context of text searches or images. It’s all the rage. But I want to talk about a new type of multi-modal representation that is actually all about the “soft linking” of data tables.
Data pipelines, data enrichment, all this stuff – it’s fragile. It’s super fragile. Linking records tends to be an “all or nothing” proposition. Everything we tend to do for data preparation boils down to a binary outcome: either the record is linked or it is not. Either the value is a number or it’s not. There’s no fuzziness. There’s no allowance for slop; folks often aggressively map values to NaN, and then call pandas.dropna() to forget about the whole thing.
But the problem is that we lose a lot of information when we do this. Just because a record isn’t completely populated doesn’t mean there is no signal in it. For one project we did, we discovered that different applications had their own formats for certain fields in the database. Over time, the popularity of these apps changed, so the representation of real world entities changed. If you then couple this sliding change with a pipeline that tosses “malformed” records, eventually those “malformed” records become the dominant representation, and the result is that you toss a lot of useful data. Over the last 6 years of working with data organizations, I’ve come to find that this problem is rampant in every one of them.
Entity resolution and joining records is even more challenging: the relationships may be 1-to-1, 1-to-many, or many-to-many. What do you do when you have a “many” scenario? Well, you might duplicate the records, except that may well violate downstream assumptions, and you end up reporting incorrect values because, say, a sum() operation blows up. Or maybe you do an aggregation and collapse the “many” into a summary and put the aggregation result on the row. Every one of these touchpoints requires a decision and has trade offs that drain your engineering time and cause decision fatigue.
Can you see why data folks are so angry? Why is this stuff so hard? Sometimes, it’s because we often start out formatting data so that it will be easy to work with for the use cases we imagine or expect at the time we’re creating the data. Other times, the data format is simply the most convenient representation. Either way, there’s a disconnect between the generated data representation and the expected data representation.
But it doesn’t actually matter–because the downstream utilization of the data is varied, it is dynamic, and it’s growing in scope every day. Our boss asks for a new report. We want to leverage a new AI model. The customer wants a new aggregation. No matter what, the schema we start with is rarely the one we actually want when we have new requirements from the business.
The ideal solution–what we actually all need–is a way to disconnect the source data representation from the query representation. We want to associate data for enrichment and linkage but in a way that is inherently associative and multimodal. Then we can query table B (the query representation) by using things we know about table A (the source data representation).
This is the exciting part of how Featrix works with data: it can consume data from a variety of tables (pandas dataframes, csv files) and embed these different sources together into a single vector space.
Under the hood, Featrix indexes vectors for efficient queries with clustering and nearest neighbors. Then it trains new downstream models on the data.
And this is all automatic (with knobs for fine tuning).
It’s hugely powerful. As an experiment, we gave Featrix to a group of experienced ML engineers. For a project that took them weeks to build using traditional (manual) methods, they used Featrix and got the same job done in just a few hours.
In fact, we’ve made it so easy to add new sources of data into AI models that we casually and confidently try new things while we’re on demo calls with customers.
You’re gonna want to check this out. Sign up below and we'll get in touch.