Documentation

Semantic auto-clustering

When you upload a dataset to Airtrain, useful insights are automatically derived. One of the most useful one for discovering the content of your dataset is Airtrain's automatic semantic clustering.

Airtrain will generate embeddings for each row in your dataset, apply an optimized clustering algorithm, and label the discovered clusters. This process is repeated twice to establish a hierarchical structure of so-called "base cluster" (a few hundred granular clusters), and "meta clusters" (a few dozen clusters made of base clusters).

Example on the MMLU-Pro dataset

Here is the data explorer for the MMLU-Pro benchmark dataset: https://app.airtrain.ai/dataset/290ba84d-da8b-4358-9cf4-9e51506faa80/null/1/0

Here are the semantic clusters automatically derived from this dataset.

The outer ring of the pie chart represent the base clusters and the inner ring represents meta clusters.

So examples of meta clusters are:

  • Biology Questions
  • Scientific Calculations
  • Legal & Moral Implications
  • Mathematical Problem Solving
  • etc.

Working with clusters

You can use these clusters to drill down into your dataset. To do so, select the cluster you want to look into then click the "Add filter" button.


Then you will be able to browse rows corresponding to this cluster. You can combine these filters with any other insight filters.