Skip to content

Opening up Structured Data to Generative AI

The conversations about Generative AI have been focused on unstructured data - leading intelligent conversations, semantically searching video, images, audio, helping you code or troubleshoot. Structured data - which after all still represents 20% of all data, and is arguably “denser” by nature - is critical in a lot of day-to-day business transactions and decisions; like ordering products, financial transactions, manufacturing processes, to name a few. Picture the type of data that you see in places like CSV files, database, and Snowflake and other data lakes full of structured information.

The data driving these processes are stored in numerous deployed business applications and databases - often alongside unstructured data, like logs, customer service records, and sensor measurements. Decisions based on the structured data typically are made using long-standing machine learning approaches, including classifiers and regression models; the data needed to train these models is obtained by joining multiple data sources, and manually labeled. It’s a tedious process, but it’s not going away anytime soon - read this earlier blog for a discussion of the challenges associated with data fusion.

There’s still a lot of value in structured data

Structured data, while less in volume, carries a lot of critical information in today’s business world. Examples include transactional data and system measurements.

You can learn more about structured versus unstructured data and the traditional approaches analyzing them from - for example - this educational page.

Often, unstructured data is layered within a structured datastore; examples include system logs alongside measurements, and product descriptions with their specs, quantity and cost.

While generative AI opened up a multitude of ways to query unstructured data and create new insights, deeper insight can be gained by mining unstructured and structured data jointly:

  • Global phenomenon may be visible in unstructured data, like a broad narrative about climate change and its effects, while numeric trends on structured data enable predictions, like how much temperature has changed in specific regions
  • In healthcare, diagnoses are typically based on patient observations together with lab results and treatment details, and a data-driven individualized approach enables personalized treatments
  • In finance, transactions records by themselves don’t reveal customer intent, and any consulting on strategy needs to consider both

Why LLMs aren’t suitable for structured data

Popular Large language Models like the GPT family will accept structured data, but they aren't specifically designed to make inferences based on structured data in the same way they handle natural language. They excel at understanding, generating, and working with natural language text but struggle with structured data formats like tables, databases, or spreadsheets. That’s because their architecture was designed for sequences of words, and unstructured data represents the vast majority of their training data.  The results you typically find when querying the GPT family for numerical data are very underwhelming.

Existing methods can process both individually - LLMs for unstructured data, standard database analytics and machine learning for structured data - but neither is capable of making inferences that draw on both. Creating a joint representation is difficult - data sets come from different sources that aren’t aligned, and lack keys and common values - which are so critical to perform joins in traditional relational databases.

New approaches for analyzing structured data

Three approaches have emerged that open up structured data to the power of modern AI:

  1. Transform structured data into text form
  2. Adapt LLMs to structured data
  3. Embed structured data

 

Let’s understand how each of these approaches work at a high level. You can think of the first as advanced prompt engineering. We are still using the LLMs you’re all familiar with, but structured data and its relationships are transformed into textual format. It’s like writing stories with the structured input: pull all the facts into statements alongside how the various dimensions are connected, the natural language equivalent of a join. Learn more about this approach, for example, in this blog.

Second, you can fine-tune a Large Language Model using transfer learning techniques. By training on datasets that contain both structured and unstructured data, these models learn to handle structured information more effectively. In a way it’s an extension of using LLM to write database queries, and ChatGPT is already able to write SQL queries out-of-the-box. However - just like with question-answering tasks, where LLMs sometimes hallucinate the answer, automatically generated SQL queries sometimes do not align with user intent or database lexicon. And applying transfer learning requires significant expertise with AI frameworks, curated training data, and time to iterate until you achieve satisfactory results.

 

Finally, you can generate embeddings on structured data, and apply the same techniques that have proven so powerful with unstructured data. That’s exactly what Featrix has been designed to support: 

  • Autojoin discovers how to link previously disparate data. Featrix lets you construct data representations that are robust, model a variety of relationships (from 1:1 to many:many), without getting you bogged down in details. Further, simplifying handling of timestamps, dates, or time strings in a column, Featrix automatically adds new columns capturing different representations of the time.
  • Embed structured data: create a vector space and encode your data into embeddings in the vector space. You can train multiple downstream models within a single vector space; you can encode data from multiple sources into a single vector space, and you can further fine-tune the models for specific tasks downstream.

In a recent blog, we walked through the example of inferring the relationship between type of pet and their weight from data sets that contained either height or weight of the pet, but not both - and below you can see the relationship inferred by embedding the structured type - height and type - weight tables, essentially confirming the intuition that dogs are generally heavier than cats. If you’re looking for the details, you can go back to the blog

 

This situation is common when dealing with tabular data. Data sets come from different sources, and do not always contain the data you would need to join them.


Ready to try this yourself? Go to the Featrix documentation, and contact us to discuss your use case with a Featrix expert.