AI scoring

Airtrain uses a dedicated scoring model to grade inferences of other models on a scale of 1 to 5.

In the "Evaluation methods" section of the New Job form, activate the "AI scoring" section and fill in details.

You can describe in plain english the properties you are trying to evaluate and grade. You can create as many scoring metrics as you want.

Templating

The description of the property can be templated using the same templating as model prompts. Double curly brackets {{value}} will interpolate with the value field from the input dataset.

Additionally, {%prompt} will interpolate to the fully rendered input prompt.

Tips to maximize score quality

Here are some tips to get the best possible results out of our scoring model.

Unidimensional properties

Make sure to break down the properties you want to evaluate into their most elementary components. The more focused and straightforward the property, the easier it will be for the scoring model to reason about it.

For example, instead of defining a broad property as such

prefer breaking it down into three properties

Add a rubric

We have observed that using a rubric yields more accurate score. Try to provide specific criteria for each score. You can use the 1 to 5 range to design your own scoring range.

For example:

The score describes how well justified the response is based on the context in the prompt. The following is a grading rubric to use:

1. The answer is incorrect OR the reasoning behind the answer doesn't reference the context at all.
2. The answer is incorrect, OR it uses an external statistic, book, or other reference as a key element in its reasoning.
3. The answer may or may not be correct. It may make reference to external facts, but only if they are fairly common knowledge.
4. The answer is correct and was derived from the context using almost no external knowledge.
5. The answer is correct. Either the answer is directly in the context in a way that requires no derivation, or the derivation uses only rigorous reasoning like correctly applied math or spatial reasoning.

Now, for the response you are grading, the context and question are:
<context>{{context}}</context>
<question>{{question}}</question>