Human-Aligned Text-to-SQL Evaluation

In my last post about Text-to-SQL using GPT-3.5, I pointed out the issue with existing benchmark’s evaluation metric: it rejects perfectly fine SQL queries for not having the same strict execution result as golden (aka. labeled) queries. In this post, I discuss a new metric I created to better evaluate Text-to-SQL performance.

Here is an example from my last post:

-- (Hard) 
-- Question: Which grade has the most high schoolers?
-- pred:
SELECT grade, COUNT(*) AS num_highschoolers
FROM Highschooler
GROUP BY grade
ORDER BY num_highschoolers DESC
-- gold:
SELECT grade
FROM Highschooler
GROUP BY grade
ORDER BY count(*) DESC

In this example, the generated query (top) is marked as an error. Comparing to the golden query (bottom), the generated query has an extra computed column for student counts. In my opinion the query generated by GPT-3.5 returns more informative answer to the original question.

To recap the numbers from my last post: I tested GPT-3.5 on 96 questions from the Spider benchmark (86 from the training set and 10 from the development set) and found that it scored 41.7% in execution accuracy and 38.5% in exact match accuracy. However, after further analysis of the error cases, it became clear to me that most of the queries generated by GPT-3.5 were actually perfectly fine from a human perspective. This highlights the problem with our existing evaluation metrics for Text-to-SQL, as they do not accurately measure human preference.

This issue of mis-alignment between benchmark evaluation metrics and actual human preference is well-known in the research community. For instance, in document summarization benchmarks, the standard evaluation metric is ROUGE, which is based on counting overlapping word phrases between a generated summary and a golden summary. This approach fails to detect correct summaries written in different terms and phrases from the golden summary. Despite this, ROUGE scores are still being used by recent papers.

Human evaluation remains the gold standard in ML research, and automatic metrics are the “unit tests” for new models. Is it possible to have something in between, like a “staging environment” for new models?

GPT-3.5 for Evaluation

GPT-3.5 has been fine-tuned to align with human preference, making it much better at following instructions than GPT-3, which requires careful prompt engineering to work effectively. This got me thinking: can we leverage GPT-3.5 to create an evaluation metric for Text-to-SQL that aligns with human preference? Let’s break it down step-by-step:

First, we need to consider what GPT-3.5 is capable of doing:

  • It can transform structured data to text, and vice versa.
  • It can reason based on facts provided directly in the prompt context.

What does a Text-to-SQL task entail?

  • An input question in natural language text.
  • Generating a SQL query.
  • Executing the SQL query and obtaining structured data (i.e., rows).

To evaluate a generated SQL query, we can start by asking GPT-3.5 to answer the input question using the structured data result obtained from executing the query. This will give us a natural language answer to the input question. We repeat this step for the golden SQL query. Then, we ask GPT-3.5 whether the hypothesis answer generated from the generated query is correct, given the input question and the reference answer generated from the golden query’s result. We receive a binary Yes or No response from GPT-3.5.

The diagram below illustrates this design:

Evaluating Text-to-SQL task using GPT-3.5
Evaluate Text-to-SQL using GPT-3.5. The actual prompts can be found in this script.

By using GPT-3.5 in this way, we can create an evaluation metric for Text-to-SQL that better aligns with human preference.

Experimental Results

I extended the Spider benchmark’s evaluation with this new metric, which I named Answer Accuracy, as part of the LlamaIndex open source project. The code I used can be found here.

Using this new metric, I evaluated the generated SQL queries of the same 96 questions. The overall answer accuracy for GPT-3.5 text-davinci-003 is an impressive 88.54%. In fact, upon reviewing the error cases listed in my last blog post, I found that the cases that I pointed out to be correct were indeed marked as such, and the one true error case was correctly caught

To further validate this new metric, I tested it on outputs from other GPT models, including GPT-4. Here is the a plot of their answer accuracies:

Answer accuracies of GPT models (zero-shot) on 96 questions from Spider benchmark: code-davinci-002, 0.7917; text-davinci-003, 0.8854; gpt-3.5-turbo, 0.8542; gpt-4, 0.8958.
Answer accuracies of GPT models (zero-shot) on 96 questions from Spider benchmark

The resulting answer accuracies are as follows: code-davinci-002 at 79.17%, text-davinci-003 at 88.54%, gpt-3.5-turbo at 85.42%, and gpt-4 at 89.58%. As expected, GPT-4 outperformed the other models, which aligns with our practical experience of these models’ relative performance.

In contrast, the execution accuracies and match accuracies of GPT models do not align with our practical experience. Here is a plot showing these metrics used on GPT models.

Execution and exact match accuracies of GPT models on 96 questions sampled from Spider benchmark

If we use these metrics as the primary measurement of Text-to-SQL performance, we would think gpt-4 is less powerful than code-davinci-002. These results clearly do not match our practical experience with these models.

Key Implications

What are the key implications of this Text-to-SQL experiment?

Firstly, we found that the existing automatic metrics, namely execution accuracy and exact match accuracy, are mis-aligned with human preference. As a result, they should not be relied upon as the primary way to evaluate Text-to-SQL models.

Second, while human evaluation remains the gold standard, large language models (LLMs) such as GPT-3.5 and above can accurately approximate human preference. This is a potential game changer, because they allow us to evaluate Text-to-SQL models and estimate the expected human evaluation result. This is especially useful when human evaluation is not feasible due to budget limitation.

Thirdly, and this is where things get interesting, our use of instruction-tuned LLMs for evaluation could be generalized to many other tasks. This approach is especially useful for tasks whose output consists of simple statements that can be directly reasoned over by LLMs. And there’s a research paper on this very topic called “GPTScore”, which trains GPT-3.5 to evaluate NLP tasks using in-context learning.


This post is intended to provoke a conversation about Text-to-SQL evaluation, not to claim the answer accuracy metric is 100% human aligned or even robust. I am happy to collaborate if you would like to take a step further toward a better evaluation.



Researcher in Data Systems working at Microsoft Research. Views published here are not endorsed by my employer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Eric Zhù

Researcher in Data Systems working at Microsoft Research. Views published here are not endorsed by my employer.