Skip to content

Simplify Data Exploration with Embeddings

Using tabular embeddings and AI to simplify Exploratory Data Analysis

Exploring your data - often the first step in building predictive models - is still notoriously difficult and labor intensive. In this blog, we'll discuss how embeddings that transformed processing of natural language can also simplify data exploration, launching us into the era of AI-powered data insights.

I'll start us off with a brief review of what Exploratory Data Analysis (or short, EDA) is. Then we'll look at why EDA has been difficult, and where AI can alleviate challenges. The third section jumps into the heart of the matter - how embedding structured data not only makes predictive modeling easier, but also explorative data analysis. We'll illustrate the approach with a case study from the sales and marketing domain, before giving you the key take aways as the conclusion.

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a crucial first step in any data-driven project. It's the process of diving into a dataset to understand its main characteristics, patterns, and peculiarities before applying more complex analytical techniques. Think of it as getting to know your data intimately before you start asking it tough questions.

In EDA, analysts use statistics and visualizations to understand the data. This process often involves summarizing key statistics, identifying trends and relationships between variables, detecting outliers or anomalies, and spotting any data quality issues. It's like being a detective, looking for clues and piecing together the puzzle of what your data represents. EDA helps come up with ideas about the data that can be tested later using more advanced statistical methods. 

Stock-EDA

For those familiar with traditional business analytics, you can think of EDA as a more comprehensive and systematic approach to the initial data review you might do before creating a report or dashboard. It goes beyond simple summary statistics to provide a deeper understanding of the data's structure and content, setting the stage for applying advanced analytics and machine learning.

In the context of predictive modeling using traditional ML, EDA can inform decisions about which machine learning models might be most appropriate, and the notoriously difficult steps of feature selection and engineering. EDA provides cues which variables are most relevant to the problem at hand.

Exploratory Data Analysis: Challenges and How AI can Alleviate them

Exploratory Data Analysis (EDA) was created in the 1970s. It has been widely used and has evolved over time. However, it still involves a lot of manual effort, because of the following challenges:

  Challenge What's the problem Data Volume and Complexity
1 Data Volume and Complexity Large datasets are overwhelming and time-consuming to analyze thoroughly, making it difficult to grasp holistic insights.

Rapidly process and analyze vast data volumes, providing concise summaries that make complex datasets more digestible.
2 Data Quality Issues Missing values, outliers, and inconsistencies can skew analysis and severely impact classic ML model performance. Automatically detect anomalies, missing values, and inconsistencies. Generative AI can suggest potential corrections or imputations.
3 Feature Selection Identifying the most effective variables for ML tasks is notoriously difficult, especially with high-dimensional data. Balancing between overfitting (too many features) and missing insights (too few features) Implement automated feature selection and dimensionality reduction using techniques like principal component analysis, mutual information, or t-SNE.
4 Visualization Choosing appropriate visualizations to reveal patterns without misleading interpretations requires skill and experience. High-dimensional data need dimension reduction for effective visualization. Suggest and create custom visualizations tailored to specific datasets and analysis goals, based on data characteristics.

EDA-Challenges

These steps are mainly for coming up with ideas about your data, but be careful not to see patterns that aren't really there. AI and data processing automation can dramatically alleviate these challenges, but they don't replace human expertise. Instead, they augment human capabilities, allowing analysts to work more efficiently and focus on more important tasks, including extracting actionable insight from the data and decision-making.

Striking a balance between leveraging AI capabilities and maintaining human oversight ensures the analysis remains relevant, ethical, and aligned with business goals. Organizations can gain a better understanding of their data by combining AI technology and human input. This can help them make smarter decisions and come up with creative solutions.

How Embeddings can Transform EDA

In a previous blog, I mentioned that embeddings can be used for both unstructured and structured data. Let’s consider the various ways embeddings can simplify exploratory data analysis.

Uncovering hidden relationships

Embeddings can reveal intricate relationships within your data that might not be apparent in the original feature space. By mapping high-dimensional data to a lower-dimensional space, embeddings simplify data by reducing dimensions, grouping similar points together. This helps find patterns like clusters, outliers, and trends more easily, for example, discovering customer segments you didn't know existed or identifying subtle correlations between seemingly unrelated variables.

Embeddings are particularly useful for dealing with high-cardinality categorical variables. Instead of one-hot encoding, which can lead to sparse, high-dimensional data, you can use embeddings to represent categories as dense vectors. This not only reduces dimensionality but also captures semantic relationships between categories. For example, in a product recommendation system, embeddings could capture the similarity between different product categories without explicit feature engineering.

Dimensionality reduction and Visualization

One of the biggest challenges in EDA is visualizing high-dimensional data. Embeddings address this by reducing the dimensionality of your data while preserving important relationships. You can use the embedding space to create 2D or 3D visualizations of complex datasets, similar to t-SNE (Stochastic Neighborhood Embedding). After making scatter plots of your data, you may easily see groups of similar data or unusual points.

The embedding representation facilitates a more efficient initial data exploration. Operations like clustering, nearest neighbor search, and similarity comparisons can be faster and more meaningful in the embedding space compared to the original feature space. This significantly speeds up your EDA process, allowing you to iterate and test hypotheses more quickly.

Facilitating feature selection and engineering

Embeddings can be a powerful tool for feature selection and engineering. By examining the embedding space, you can determine which original features have the biggest impact on the data's structure. This guides you in selecting the most relevant features for your models. The embedding vectors can be used as new features to capture complex relationships in a compact way. This helps make machine learning models more effective and efficient.

Customer Story: Exploring Sales Data

In this section we’ll bring this to life using a customer’s use case. For context, we were asked to help extract insight from a large data set of retail transactions, in the millions sales over a decade. Using Featrix, we embedded individual order transactions that were characterized by typical variables like country, product name, description, number of orders, and amount spent. Unlike applying traditional EDA, we didn’t have to spend hours or days wrangling the data into shape, the raw transactions became the input to Featrix.

For the visualizations, we projected the embedding space to two dimensions using t-SNE. Then, we colored based on original variables that we kept alongside the embeddings. The top popular products are shown in different colors, everything else shown in black. 

Here’s some of the insights you can glean, knowing that clusters do share a similarity - since their members are related. In the chart below, color represents the product, clearly identifiable clusters correspond to specific products. The amorphous regions are the “pile” of everything else.and more orders appear on the right. That wasn’t intuitive since there were thousands of different products and customers.

EDA-Case-BasePlot-dWM

In the next plot, color represents the number of orders placed by a specific customer. There are two main groups in the chart. The yellow cluster on the left shows items ordered by customers who placed only two orders. The more to the right the dot is, the more valuable the customers are who place many orders over their lifetime. This sort of plot would allow a marketer to develop a customer segmentation model, and target high-value customers. - With traditional EDA, you would need to group the sale transaction data by the number of orders received per customer to arrive at the same plot.

EDA-Case-DirectionOrders-dWM

This third plot reveals that different clusters actually have different meanings, it’s not just the number of orders. Here we colored by the country of residence of the customers. Most of the data has no correlation with the country - with a couple exceptions, where the order value is anomalous for a subset of orders ($0) from a few countries. Note: the gray bar hides sensitive specifics of the data set.

EDA-Case-Country-dWM

Conceivably, you could get similar findings using a more traditional approach. However, that means first spending significant time preparing your data, followed by running a standard clustering algorithm. What’s remarkable about the embedding approach is:

  • The transformation is learned - you don’t have to worry about picking a dimension that carries actual meaning as the axis of the plot.
  • You get an API to access the data and a predictive function built on top of it. That lets developers integrate this type of analysis easily into applications. 

Conclusion

In conclusion, what benefits can you get from applying embeddings in Exploratory Data Analysis (EDA)?

  1. Unified Representation: Embedding transforms structured data into a space where distances are inherently meaningful. This is particularly advantageous for datasets with mixed data types (numeric and non-numeric). Traditional EDA often requires manual transformation of non-numeric variables for anything beyond basic visualizations and queries.
  2. Learned Distance Metric and Features: Embeddings create a representation where related elements cluster naturally. because the training object was to maximize mutual information between the dimensions (columns). The resulting learned distance metric significantly simplifies downstream analysis and predictive modeling. In contrast, traditional approaches require explicit feature extraction to identify relevant dimensions before proceeding with advanced EDA techniques like clustering.
Leveraging embeddings in your EDA process uncovers deeper insights than traditional approaches, using the power of AI. Also, representing structured data with embeddings facilitates building predictive models: you can obtain performant models in a single step, whereas traditional machine learning models typically require many iterations. This approach can be particularly valuable when dealing with complex, high-dimensional datasets where traditional EDA techniques might fall short - because it can be difficult to know what to look for. Finding the dimensions that show patterns in the data may require a lot of trial and error.

If you’re ready to work with the SDK and API, sign up for our free trial, follow us on LinkedIn, and share your feedback with hello@featrix.ai!