Batch evaluation
Score and compare LLMs on your entire dataset at once, using domain-specific metrics.
Batch evaluation lets you generate evaluation metrics on your entire dataset and for all selected models at once. Airtrain lets you configure each model and select metrics of interest to you.
The steps to run an evaluation job are the following:
- Upload your evaluation dataset
- Select models to compare and configure them
- Select and specify metrics
- Start the job
- Visualize metrics distributions and browse through individual inferences
Start an evaluation job
Click New Job in the navigation bar then select Evaluation Job. You will see the following form.
Name your job
First, input a unique name for your job. This is the main identifier for your job. Feel free to format the job name as you wish, for example eval-job-1-new-prompts
, or any other naming convention you wish.
Configure the data source
In the leftmost section of the form, select a data source type.
JSONL files
JSONL is a convention to store datasets in JSON formats. Each line in the file must be a valid JSON object. All lines should have the same schema. The keys for each row should consist only of alphanumeric characters and '_'.
Click the upload button to select and upload your JSONL file.
CSV files
You can also provide your data as a CSV file. A header row is required. The names of the columns in the header row must consist only of alphanumeric characters and '_'.
Other data sources
At this time Airtrain only supports JSONL and CSV files as data sources. If you would like Airtrain to offer additional data sources (e.g. plain text files, PostgreSQL, Snowflake, etc.), reach out to us on Slack and let us know what you need.
Select models to compare
In the central section of the form, select models you want to compare.
For each model, give it a name, set the prompt, select the model variant, and set the temperature. You can use the name to store details about this particular model configuration. For example llama-2-7b-temp06-full-prompt
.
Prompt templates
To specify the prompt, use a combination of raw text and template variables enclosed in double-curly braces. Ex:
Talk about the {{weather}} weather
weather,mood
rainy,sad
sunny,happy
hot,angry
This prompt template would render to things like "Talk about the rainy weather" or "Talk about the sunny weather" if your input jsonl/csv had a column named 'weather' with values like 'rainy' or 'sunny'. This curly brace template syntax can be used in many places in the Airtrain UI. Note that your prompt template MUST include at least one template variable. If you want to repeat the same prompt several times, you can prepare a jsonl file which has the same prompt repeated for one of the fields and then use that field in the prompt template. Ex:
{{my_prompt}}
{"my_prompt": "Give me a random number between 1 and 10."}
{"my_prompt": "Give me a random number between 1 and 10."}
{"my_prompt": "Give me a random number between 1 and 10."}
...
Special characters in prompt templates/columns
Prompt templates are not allowed to have variables with special characters. Special characters are considered to be anything except alphanumeric characters and
_
. If your data has columns with special characters in it that you wish to use in your prompt template, you will need to change the data to remove the special characters or replace them with valid ones.
Full chat support
You may notice that Airtrain's model configuration is structured for a single prompt/response pair, rather than for a full chat/conversation involving multiple turns for the user and assistant. Support for full chats is on our roadmap--if you need support please reach out on our Slack .
Supported Models
At this time Airtrain supports the following models:
Custom model
If you want to extract evaluation metrics for your own custom model, you can include the inferences to grade as part of the input dataset. Then select Custom Model in the central panel.
Configure evaluation metrics
Airtrain offers three evaluation mechanisms.
- AI scoring – Describe the properties you want to score in plain english and let our scoring model grade inferences.
- Unsupervised metrics – standard metrics such as length, compression, and density.
- JSON schema validation – validate the compliance of your generated JSON payloads
- Reference-based metrics - compare the inferences to an expected reference
Updated 9 months ago