What is Coming Next for Text-to-SQL

Eric Zhù
10 min readMar 7, 2023

Text-to-SQL is a natural language processing (NLP) task that involves converting natural language questions into SQL queries that can be executed on a database.

Here is a Text-to-SQL example on a TV show database:

Question: What is the content of TV Channel with serial name “Sky Radio”?

Model: SELECT Content FROM TV_Channel WHERE series_name = 'Sky Radio';

Result: music.

The main motivation for Text-to-SQL is that a natural language interface helps users without technical expertise to write SQL queries to interact with databases. Wouldn’t you like to be able to speak to your relational database in English?

In this post, I will review the current state-of-the-art Text-to-SQL systems, experiment with the latest OpenAI GPT model for Text-to-SQL, and finally lay out my thoughts about what is coming next.

The Current State

Text-to-SQL systems are currently evaluated using the Spider benchmark, which includes 10k questions with labeled SQL queries.

The state-of-the-art approaches that score highest on this benchmark typically use a simple framework: they take an existing language model (such as BERT or T5), which has been pre-trained on natural language text, and then perform supervised training on labeled SQL generation examples. This process is called “fine-tuning”. Some approaches build on top of this framework by adding a pre-processing stage to reduce the difficulty of generation, or by adding a post-processing stage to improve the correctness of generated SQLs.

The main issue with this framework (a pre-trained language model + fine-tuning) is cost. Fine-tuning produces a task-specific model that you must host separately from the pre-trained model, only for Text-to-SQL generation tasks. Additionally, there is the compute cost of fine-tuning, which must be repeated to keep up with the latest new examples. As language models get larger, these costs will increase as well.

A New Wave — How good is GPT-3.5?

OpenAI’s GPT-3.5 models, specifically the davinci series and ChatGPT, have impressed us with their mastery of languages and ability to write code. They can even write SQL queries without any fine-tuning: just give them a schema and a question, and they will write SQL for you. With proficiency in over 20 human languages and many programming languages, this ability is not entirely surprising. But how good is it really? To evaluate its performance, I conducted some experiments.

Setup

For this, I used LlamaIndex (version 0.4.19), which has an easy-to-use API utilizing OpenAI’s GPT models for querying structured and unstructured data sources, including text documents, knowledge bases, and relation databases, using natural language. I connected it with the Spider benchmark’s databases and ran the evaluation using the latest GPT-3.5 model, text-davinci-003. If you’re interested, you can find the end-to-end benchmark scripts here.

In the Spider benchmark, execution accuracy measures the fraction of examples in which the predicted SQL queries’ execution output matches that of the golden (i.e., human-labeled) SQL queries. Exact matching accuracy measures the fraction of examples in which the predicted SQL matches the golden SQL syntactically. However, exact matching accuracy is a less robust measure because it flags two equivalent SQLs as a mismatch. See the Spider paper for details.

What I found

In the training examples (where GPT-3.5 is used without fine-tuning, so it’s all blind runs), it achieved 40.7% execution accuracy. On the development (i.e., validation) examples, it achieved 50% execution accuracy.

Refer to the tables below for a summary of the results. To keep costs low for this blog post, I randomly sampled 1% of all questions in the benchmark. As a result, there are 86 questions from training examples and 10 questions from development examples.

Results on Training Examples

+------------------------+-------+--------+-------+-------+------+
| | easy | medium | hard | extra | all |
+------------------------+-------+--------+-------+-------+------+
| count | 24 | 28 | 17 | 17 | 86 |
| EXECUTION ACCURACY | 0.833 | 0.357 | 0.176 | 0.118 | 0.407|
| EXACT MATCHING ACCURACY| 0.833 | 0.321 | 0.118 | 0.059 | 0.372|
+------------------------+-------+--------+-------+-------+------+

Results on Development Examples

+------------------------+-------+--------+------+-------+------+
| | easy | medium | hard | extra | all |
+------------------------+-------+--------+------+-------+------+
| count | 3 | 3 | 3 | 1 | 10 |
| EXECUTION ACCURACY | 1.000 | 0.333 | 0.000| 1.000 | 0.500|
| EXACT MATCHING ACCURACY| 1.000 | 0.333 | 0.000| 1.000 | 0.500|
+------------------------+-------+--------+------+-------+------+

In comparison, the current №1 model on Spider benchmark achieves 84.1% execution accuracy and 80.5% exact match accuracy on the complete development examples. So at first glance this seems to be a huge gap between the zero-shot, blind run performance of the best in-class GPT-3.5 model (50%, 50%) with best fine-tuned Text-to-SQL model (84.1%, 80.5%).

Error Analysis

Through past experimentation with GPT-3.5 models including ChatGPT, I have found that they can generate very complex queries and suggest rewrites with better execution performance with just a few hints. It seems to me that these models really understand the SQL language. So the accuracy number from the Spider benchmark is surprising. After all, while baseline models can only perform Text-to-SQL, GPT-3.5 models can also write Python!

So I took a closer look at the error cases in the development examples. What I realized is that the Spider benchmark’s accuracy measure is too strict: many generated SQLs are semantically correct for humans, but still marked as error. Here are the 5 errors in the development examples.

Error Example 1.

-- (Medium) 
-- Question: List the cost of each treatment and the corresponding treatment
-- type description.
-- pred:
SELECT Treatments.cost_of_treatment, Treatment_Types.treatment_type_description
FROM Treatments
INNER JOIN Treatment_Types
ON Treatments.treatment_type_code = Treatment_Types.treatment_type_code;
-- gold:
SELECT T1.cost_of_treatment, T2.treatment_type_description
FROM Treatments AS T1
JOIN treatment_types AS T2
ON T1.treatment_type_code = T2.treatment_type_code;

The predicated SQL query is really the same as the golden SQL in this case, apart from different alias and JOIN hints — which doesn’t really matter here.

Error Example 2.

-- (Median) 
-- Question: What is the TV Channel that shows the cartoon
-- "The Rise of the Blue Beetle!"? List the TV Channel's series name.
-- pred:
SELECT TV_Channel.series_name
FROM Cartoon
INNER JOIN TV_series ON Cartoon.id = TV_series.id
INNER JOIN TV_Channel ON TV_series.Channel = TV_Channel.id
WHERE Cartoon.Title = 'The Rise of the Blue Beetle!';
-- gold:
SELECT T1.series_name
FROM TV_Channel AS T1
JOIN Cartoon AS T2 ON T1.id = T2.Channel
WHERE T2.Title = "The Rise of the Blue Beetle!";

GPT seems to have produced an unnecessary JOIN with TV_series. This is probably due to the mention of “series” in the question. This did not affect the number of results.

Error Example 3.

-- (Hard) 
-- Question: Which grade has the most high schoolers?
-- pred:
SELECT grade, COUNT(*) AS num_highschoolers
FROM Highschooler
GROUP BY grade
ORDER BY num_highschoolers DESC
LIMIT 1;
-- gold:
SELECT grade
FROM Highschooler
GROUP BY grade
ORDER BY count(*) DESC
LIMIT 1;

GPT added an extra column for count, which I actually think is better than the golden query.

Error Example 4.

-- (Hard) 
-- Question: Find the name of the winner who has the highest rank points
-- and participated in the Australian Open tourney.
-- pred:
SELECT winner_name
FROM matches
INNER JOIN rankings ON matches.winner_id = rankings.player_id
INNER JOIN players ON matches.winner_id = players.player_id
WHERE tourney_name = 'Australian Open'
AND rankings.ranking_points = (SELECT MAX(ranking_points) FROM rankings)
ORDER BY rankings.ranking_points DESC;
-- gold:
SELECT winner_name
FROM matches
WHERE tourney_name = 'Australian Open'
ORDER BY winner_rank_points DESC
LIMIT 1;

The question is unclear as it mentions “highest rank point and participated in the Australian Open tourney”, which seems to be two independent conditions. So I wouldn’t fault GPT for this one. Maybe it should use “LIMIT 1”, however the number of results here doesn’t matter much in practice in my opinion. In fact, the results are all “Serena Williams”.

There is only one objective failure case.

Error Example 5.

-- (Hard) 
-- Question: What are the names of the dogs for which the owner has not spend more than 1000 for treatment ?
-- pred:
SELECT D.name
FROM Dogs D
INNER JOIN Owners O ON D.owner_id = O.owner_id
INNER JOIN Treatments T ON D.dog_id = T.dog_id
WHERE T.cost_of_treatment <= 1000;
-- gold:
SELECT name
FROM dogs
WHERE dog_id NOT IN (
SELECT dog_id
FROM treatments
GROUP BY dog_id
HAVING SUM(cost_of_treatment) > 1000
);

In this case, GPT did not use the correct SUM aggregate to filter the dogs. This is an error.

My Takeaways

If the results were evaluated by a human, such as myself, the first four supposedly erroneous cases should have actually been marked as correct, resulting in an accuracy of 90%. However, it still falls short of achieving 100% accuracy. Therefore, I will not trust GPT to write my future SQL queries yet, but it can be a useful assistant for generating a draft or completing a template that I have laid out.

What’s Next — In-Context Learning

GPT-3.5 models have demonstrated that they can be trained to perform specific tasks by providing them with examples. However, several months before their release, researchers were already discussing a new approach to teach machine learning models task-specific abilities known as “in-context learning” or “few-shot learning.”

To train a GPT model using in-context learning, you simply provide some instruction and a few informative input-output examples within the prompt, and the model will learn how to perform the task based on the examples provided. Note that this method does not involve fine-tuning the model, which means that a generic GPT model can be used without requiring expensive hardware. Additionally, the latest GPT model, davinci, can handle up to 4000 tokens, which is sufficient for a few input-output examples even for complex SQL queries.

Here is an example I created to demonstrate in-context learning. The following is a prompt to OpenAI’s text-davinci-003 model.

Given a question, generate a SQL query in the dialect of PostgreSQL. 
Use the following format:

Schema: "a create table statement here"
Question: "question here"
Query: "the SQL query here"

Here are some examples.
The comments between /* and */ are for demonstration purpose.

Example 1
Schema: create table orders (
tstamp timestamp,
volume float
);
Question: generate a time series of order volumes using.
Query: select ts_aggregate(tstamp, volume) as ts from orders;
/* A table with values:
|| tstamp || volume ||
| 2021-03-23 | 32.1 |
| 2021-04-21 | 55.2 |

Result of the query
|| ts ||
| { (2021-03-23, 32.1), (2021-04-21, 55.2) } |
*/

Schema: create table rides (
tstamp timestamp,
distance float,
vehicle_id int
);
Question: generate a time series of distance traveled for each vehicle.
Query:

In the above prompt, I begin by outlining the task at hand, which is Text-to-SQL conversion. I then provide a clear instruction for the model to follow a specific communication protocol, which enables me to programmatically parse the output if necessary. Additionally, through a single example, I introduce a new user-defined aggregate function called ts_aggregate, which is not part of Postgres. This function outputs a custom time series data type, which I also made up.

Subsequently, I pose a slightly different question that requires the use of a GROUP BY construct. The purpose of this question is to test the model's understanding of ts_aggregate as an aggregation function and its ability to work alongside GROUP BY.

The following is the output of text-davinci-003. It correctly produced the GROUP BY SQL. You can try it out yourself.

select vehicle_id, ts_aggregate(tstamp, distance) as ts from rides group by vehicle_id;
/* A table with values:
|| tstamp || distance || vehicle_id ||
| 2021-03-23 | 3.2 | 1 |
| 2021-04-21 | 5.5 | 1 |
| 2021-03-30 | 2.4 | 2 |

Result of the query
|| vehicle_id || ts ||
| 1 | { (2021-03-23, 3.2), (2021-04-21, 5.5) } |
| 2 | { (2021-03-30, 2.4) } |
*/

From this example, we can see that the model has learned and understood the usage of ts_aggregate, which was not specified in any other way in the prompt. Instead of relying solely on input-output examples, it is also possible to specify ts_aggregate using its functional signature, documentation, or implementation.

Why is this important?

As mentioned previously, “traditional” Text-to-SQL models have relied on a framework of pre-trained models followed by fine-tuning. This approach is inherently limiting because these models are designed for single-use and require dedicated hardware. For example, if a Text-to-SQL service is set up on fixed hardware, it may experience little utilization when users are offline but become unavailable when many users are online, making it a poor service.

In contrast, setting up a “multi-use” model to serve several different tasks, including Text-to-SQL, and using in-context learning to specialize the model for each task could be a more efficient solution. With this approach, there may be more flexibility to invest in hardware and better manage workload demand.

Research has already been conducted in this direction, as outlined in this paper. It will be exciting to see how this approach advances as the baseline capabilities of GPT and other large language models continue to grow.

Is Text-to-SQL Production Ready?

From this evaluation exercise, it is clear that the GPT model can write complex SQLs, but with some chance of errors, making it more human-like. The model can also get confused by unclear prompts, which is similar to how humans can sometimes misunderstand instructions. Perhaps the model is better suited for writing simple queries. Nonetheless, it cannot achieve 100% accuracy, as natural language itself cannot be precise, while SQL is a precise specification. Moreover, when the question is extremely complex, it becomes difficult to specify it precisely using natural language, and therefore, we should not expect a Text-to-SQL system to guess our latent intent correctly every time. However, this does not mean that we should give up.

One way to view the situation is that we already have very good Text-to-SQL generation for single table filter and project queries. These queries are also easily verified by humans. If the scenario is the retrieval of relevant rows in a table (and cross-compile with other pieces of information to produce a natural language response to a question, as in the case of LlamaIndex), the result needs to include the generated SQL query for a simple follow-up validation by the user.

Since the user still needs to validate each query, SQL will remain a necessary tool. However, can we make it easier for people to write SQL? Copilot is excellent for Python, but writing correct SQL requires access to the underlying schema, constraints, and user-defined objects. The ideal application scenario should involve generating useful templates and pieces that are easy to verify and incorporate into a workflow that makes the user more productive.

--

--