
Airtrain SDK

This pages holds documentation for the Airtrain SDK.


To install the core package without any integrations, simply

pip install airtrain-py

You may install integrations by using pip extras. As an example, to install the pandas integration:

pip install airtrain-py[pandas]

If you want to install all integrations, you may do the following:

pip install airtrain-py[all]

The following are available extras:

  • pandas
  • polars
  • llama-index


Obtain your API key by going to your user settings on

Then you may upload a new dataset as follows:

import airtrain as at

# Can also be set with the environment variable AIRTRAIN_API_KEY


url = at.upload_from_dicts(  
        {"foo": "some text", "bar": "more text"},  
        {"foo": "even more text", "bar": "so much text"},  
    name="My Dataset name",  # name is Optional  

# You may view your dataset in the Airtrain dashboard at this URL
# It may take some time to complete ingestion and generation of
# automated insights. You will receive an email when it is complete.

print(f"Dataset URL: {url}")

The data may be any iterable of dictionaries that can be represented using automatically inferred Apache Arrow types. If you would like to give a hint as to the Arrow schema of the data being uploaded, you may provide one using the schema parameter to upload_from_dicts.

Custom Embeddings

Airtrain produces a variety of insights into your data automatically. Some of these insights (ex: automatic clustering) relies on embeddings of the data. Airtrain will also embed your data automatically, but if you wish to provide your own embeddings you may do so by adding the embedding_column parameter when you upload:

url = at.upload_from_dicts(  
        {"foo": "some text", "bar": [0.0, 0.707, 0.707, 0.0]},  
        {"foo": "even more text", "bar": [0.577, 0.577, 0.0, 0.577]},  

If you provide this argument, the embeddings must all be lists of floating point numbers with the same length.


Airtrain provides integrations to allow for uploading data from a variety of sources. In general most integrations take the form of an upload_from_x(...) function with a signature matching that of upload_from_dicts except for the first parameter specifying the data to be uploaded. Integrations may require installing the Airtrain SDK with extras.


import pandas as pd

# ...

df = pd.DataFrame(  
        "foo": ["some text", "more text", "even more"],  
        "bar": [1, 2, 3],  

url = at.upload_from_pandas(df, name="My Pandas Dataset").url

You may also provide an iterable of dataframes instead of a single one.


import polars as pl

# ...

df = pl.DataFrame(  
        "foo": ["some text", "more text", "even more"],  
        "bar": [1, 2, 3],  

url = at.upload_from_polars(df, name="My Polars Dataset").url

You may also provide an iterable of dataframes instead of a single one.


import pyarrow as pa

# ...

table = pa.table({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})

url = at.upload_from_arrow_tables([table], name="My Arrow Dataset").url


Note that these examples also involve installing additional Llama Index integrations. A more detailed example of using Airtrain + Llama Index can be found in the Llama Index docs.

from llama_index.readers.github import GithubRepositoryReader, GithubClient  
from llama_index.core.node_parser import (  
from llama_index.embeddings.openai import OpenAIEmbedding

# Data does not have to come from GitHub; this is for illustrative purposes.

github_client = GithubClient(...)  
documents = GithubRepositoryReader(...).load_data(branch=branch)

# You can upload documents directly. In this case Airtrain will generate embeddings

result = at.upload_from_llama_nodes(  
    name="My Document Dataset",  
print(f"Uploaded {result.size} rows to {}. View at: {result.url}")

# Or you can chunk and/or embed it first. Airtrain will use the embeddings

# you created via LlamaIndex.

embed_model = OpenAIEmbedding()  
splitter = SemanticSplitterNodeParser(...)  
nodes = splitter.get_nodes_from_documents(documents)  
result = upload_from_llama_nodes(  
    name="My embedded RAG Dataset",  
print(f"Uploaded {result.size} rows to {}. View at: {result.url}")

Alternatively, using the "Workflows" API:

import asyncio

from llama_index.core.schema import Node  
from llama_index.core.workflow import (  
from llama_index.readers.web import AsyncWebPageReader

from airtrain import DatasetMetadata, upload_from_llama_nodes

URLS = [  

class CompletedDocumentRetrievalEvent(Event):  
    name: str  
    documents: list[Node]

class AirtrainDocumentDatasetEvent(Event):  
    metadata: DatasetMetadata

class IngestToAirtrainWorkflow(Workflow):  
    async def ingest_documents(  
        self, ctx: Context, ev: StartEvent  
    ) -> CompletedDocumentRetrievalEvent | None:  
        if not ev.get("urls"):  
            return None  
        reader = AsyncWebPageReader(html_to_text=True)  
        documents = await reader.aload_data(urls=ev.get("urls"))  
        return CompletedDocumentRetrievalEvent(name=ev.get("name"), documents=documents)

async def ingest_documents_to_airtrain(
    self, ctx: Context, ev: CompletedDocumentRetrievalEvent
) -> AirtrainDocumentDatasetEvent | None:
    if not isinstance(ev, CompletedDocumentRetrievalEvent):
        return None

    dataset_meta = upload_from_llama_nodes(ev.documents,
    return AirtrainDocumentDatasetEvent(metadata=dataset_meta)

async def complete_workflow(
    self, ctx: Context, ev: AirtrainDocumentDatasetEvent
) -> None | StopEvent:
    if not isinstance(ev, AirtrainDocumentDatasetEvent):
        return None
    return StopEvent(result=ev.metadata)
async def main() -> None:  
    workflow = IngestToAirtrainWorkflow()  
    result = await  
        name="My HN Discussions Dataset", urls=URLS,  
    print(f"Uploaded {result.size} rows to {}. View at: {result.url}")

if **name** == "**main**":