Vector embeddings have emerged as a powerful tool in the realm of artificial intelligence, particularly in the domains of information retrieval and generative AI systems. Traditionally, embeddings have been utilized to encode unstructured data, facilitating efficient retrieval and serving as a common representation in generative AI models. This remarkable progress allowed Enterprises to finally extract insight and value from all the unstructured data which they had accumulated in their data warehouses since the onset of digitization, by now ten times the volume of structured data.
This blog explains why embeddings are useful also for structured data - not just unstructured data. Building predictive models on top of embeddings overcomes the iterative model training and optimization process that’s been the bane of machine learning for the past decades and main barrier to its wider adoption. AutoML made a dent in the effort involved, but did not facilitate widespread adoption in industries and market segments where data science expertise is scarce. Featrix delivers predictive models on embeddings of structured data, thus democratizing access to predictive analytics.
Embeddings as "lingua franca" of AI systems
Vector embeddings are the secret sauce of vector databases. Beyond transforming keyword-based search to semantic search, embeddings have emerged as the universal language, akin to a lingua franca for AI systems: they facilitate seamless communication and interaction between system layers and modules. mbeddings capture the semantic meaning and contextual information of raw data, be it text, images, or other forms of unstructured information - and, as we’ll explain in more detail later, even structured data! What makes embeddings particularly powerful is allowing AI systems to understand and process information in a more human-like manner.
Architecturally, embeddings serve as the interface between different modules in AI systems, and bridge the gap between input data and the underlying algorithms. In a way, embeddings open up data for computational processing. For instance, in natural language processing (NLP), embeddings act as the intermediary between encoders, decoders, transformers, and other neural networks or classic AI models. When processing text data, words or phrases are transformed into dense vectors through embedding layers, enabling subsequent modules to apply additional computation using the same representation.
One of the key advantages of using embeddings as the lingua franca of AI systems is their versatility and interoperability across different tasks and domains, including multiple languages! Whether it's sentiment analysis, machine translation, question answering, or text summarization, embeddings provide a common representation framework that can be easily integrated into various architectures and algorithms.
Furthermore, embeddings facilitate transfer learning, allowing AI models to leverage pre-trained embeddings from large-scale datasets and fine-tune them for specific tasks or domains with limited labeled data. This transfer of knowledge accelerates model convergence and improves performance, especially in scenarios where annotated data is scarce or expensive to obtain.
In summary, embeddings have become the linchpin of modern AI systems, serving as the foundational building block that enables seamless communication and collaboration between different modules. As AI continues to advance, embeddings are likely to play an increasingly pivotal role in shaping the future of intelligent systems.
Embeddings for Structured Data
Embeddings can help us break through computational barriers. You may think we overcame them for structured data a long time ago, but that’s not the case if you look more closely. We’ll explain what we mean by computational barriers first in the context of natural language processing, and later extend it to other data types, including structured data, where this gets even more interesting!
In the context of processing language, the ability of embeddings to capture and leverage the semantic meaning and context explains how vector search overcomes the limitations of traditional search. Unlike traditional approaches that rely on exact keyword matches and syntactic similarities, and therefore are unable to find relevant content that doesn't share the same keywords, semantic search
built on top of vector embeddings retrieves content based on semantic similarity. Similarly, in image generation, traditional methods struggle with reusing parts of an image to create new ones. Modern generative image models leverage embeddings, enable users to describe in natural language how they want to reuse objects, and facilitate the creation of entirely new images with unprecedented ease and flexibility.
This gets even more interesting when you look at what embeddings can do for structured data. The truth is many of the actually interesting questions about structured data are challenging to answer using traditional analytics. For example, while compiling trends in transactional data every which way is easy, trends don’t answer most questions analysts are after, like what causes a particular trend; for example why did revenue flatten out, or what factors makes customers consider canceling the service you’re providing. Applying generative models to structured data into embeddings opens up these types of analysis to more automation. - Catch up on what embeddings are, and how they can be extended beyond unstructured data in this (27 min) video.
d
Predictive Models on Structured Data
For the past two decades, companies have relied on "classic" machine learning. While these models have enabled many compelling applications, including automating alerts in machinery or medicine, and controlling robots, the process of building performant models requires significant effort at the project level - not a task your typical developer can bang out in an afternoon.
Challenges with Current Approaches
Let’s understand why current approaches to building predictive models on structured data haven’t removed the barriers to their widespread adoption.
Classic ML: Traditional machine learning methods still demand substantial effort, with up to 80% of time spent on data cleaning and preparation. Generally best performing and therefore popular algorithms include random forests, support vector machines (SVMs), shallow neural networks, and boosted trees. Although AutoML has reduced the effort, primarily in model training and optimization, data preparation, deployment and ongoing monitoring persist as challenges.
Pretrained Models: Public repositories like Hugging Face offer a plethora of pretrained models, recently also offering "classic" models for classification and regression tasks. However, these models will degrade in performance on your task unless you “adapt” them to your specific task, and adapting models requires abundant training data - not in the billions like required for pre-training a LLM, but still in the thousands of “labeled” (annotated) samples.
LLMs: Initially, LLMs have struggled with numeric data, indicating its limited applicability in structured data contexts. Nevertheless, recent advancements have greatly improved the ability of LLMs to appropriately handle structured data. A notable approach involves serializing unstructured data and fine-tuning LLMs for predictive tasks.
Despite their versatility, LLMs face similar challenges in fine-tuning for structured data tasks as classic machine learning. The process requires sufficient training data (in the tens of thousand samples) and expertise in applying complex algorithms like efficient fine tuning (PEFT) and optimizing the LLM.
A Novel Approach: Embedding Structured data
Embeddings can simplify the process of obtaining predictive models also for structured data. At its core, this approach mimics what has proven so successful for its unstructured counterpart. And once you have represented the structure of the data in embeddings, generating predictive models becomes much easier also.
Featrix adopted this approach and delivers performant predictive models in two steps:
- Data Evaluation: Featrix begins by assessing the inherent structure of the data to determine its suitability for modeling.
- Predictor Construction: Once deemed suitable, Featrix constructs predictors based on embeddings as “neural functions”, bypassing the need for extensive optimization.
Featrix reduces the process of obtaining predictive models from an involved project to a task that any developer can handle. API calls simplify model deployment and integration.
Why choose Featrix?
What advantages does Featrix have over alternative approaches to predictive modeling?
- Unlike traditional machine learning approaches where teams typically spend 80% of their time on cleaning and preparing the data, and significant AI expertise is required to obtain performant models, Featrix works on raw data and delivers performant models with a single API call
- Unlike AutoML that automates just the training phase of the model lifecycle, Featrix compresses the whole model development process into a task and delivers a prediction endpoint that can be integrated into the application.
- Unlike pretrained models whose performance degrades on your task, and LLM that require AI skills to tune, Featrix doesn’t require iterative optimization.
- Unlike proprietary off-the-shelf models that degrade in performance on your task, and are costly to deploy, Featrix lets you deploy predictive models with minimal upfront investment.
Our SDK is the foundation of Featrix. It transforms your structured data into an embedding space that reveals whether your data is suitable for applying AI. If there is sufficient structure, you obtain predictors that can be integrated into your application with a simple API call. Learn more on the Featrix Prism product page.
We have used it to build turn-key solutions, so far the few common business challenges where we see the most interest: churn prediction with Retail Analytics, and prioritize CRM contacts with Featrix Haystack.
Or - if you need more guidance - consider our Featrix Smart Start offering, which gets you started with initial results and a roadmap in less than 30 days!