In my last post about Text-to-SQL using GPT-3.5, I pointed out the issue with existing benchmark’s evaluation metric: it rejects perfectly fine SQL queries for not having the same strict execution result as golden (aka. labeled) queries. In this post, I discuss a new metric I created to better evaluate…