MMLU Benchmark

Replicating MMLU benchmark results with Airtrain.

In this example, we will reproduce official MMLU benchmark results for the Llama 2 family of models using Airtrain's AI scoring model.

Llama 2 correctness as measured by Airtrain's scoring model.

Llama 2 correctness as measured by Airtrain's scoring model.

The MMLU benchmark

MMLU is one of the most popular benchmarks for Large Language Models. It features general knowledge questions ranging dozens of topics from Anatomy, to International Law, to College-legel Mathematics.

Here is an example question:

Topic: High school chemistry
Question: Chlorine gas reacts most readily with:
A. toluene
B. ethylene
C. ethanoic acid
D. ethane

The MMLU test dataset

We collated all questions from the test split of the MMLU dataset hosted by HuggingFace into a JSONL file.

You can download it here.

The row schema is as follows:

  "topic": "high school chemistry",
  "question": "Chlorine gas reacts most readily with:",
  "answer_a": "toluene",
  "answer_b": "ethylene",
  "answer_c": "ethanoic acid",
  "answer_d": "ethane",
  "correct_answer": "B"

Uploading the file

In the top menu bar, we click "New job".

Then select "JSONL file upload" in the Source type dropdown. Click "Choose file" and find your mmlu.jsonl file.

Configure the models

In the central panel, click the + button next to the model you want to configure.

Name your configuration, for example simply "Llama 2 7B". Select the 7B variant, set the temperature to 0.1 and paste the following prompt.

Here is a question on the topic of {{topic}}.

Question: {{question}}

Which of the following answers is correct?

A. {{answer_a}}
B. {{answer_b}}
C. {{answer_c}}
D. {{answer_d}}

State the letter corresponding to the correct answer.

Then, configure as many other models and variants as you want. For example, Llama 2 13B and 70B.

Evaluation metrics


Model performance on the MMLU benchmark is measured as a pass rate. What fraction of questions are answered correctly by the model.

To replicate this with Airtrain, we will create a Correctness property with the following description:

This score describes whether the chatbot selected the correct answer.
The correct answer is {{correct_answer}}.

Here is a scoring rubric to use:
1. The chatbot's answer is not {{correct_answer}}, therefore the chatbot is incorrect.
5. The chatbot's answer is {{correct_answer}}, therefore the chatbot is correct.

Airtrain's scoring model grades inferences on a Likert scale of 1 to 5. In this case, we want to measure a binary pass/fail rate, so we will use only two scores, e.g. 1 (fail) and 5 (pass) as shown above.

We can interpolate the property description with the correct answer that is provided in the input dataset.


Out of curiousity, we also activate the Length unsupervised metrics, to get a sense of what variant is more verbose.

Evaluation results

View the public results page here.


On this plot we can measure the following pass rates (score of 5) and compare them with official MMLU benchmark results listed here.

Llama 2 variantAirtrain MMLU correctness rateOfficial MMLU correctness rate

We can see that Airtrain's scoring model comes close to the official MMLU benchmark results.

As expected, we also note that higher correctness correlates with larger model size.


On this plot we can note that the 7B variant is more verbose than 13B and 70B variants. 13B is the most concise variant.