Wenbo XuHarbin Institute of Technology (Shenzhen)wenboxu707@gmail.com,Liang YanHarbin Institute of Technology (Shenzhen)Inspur Cloud Information Technology Co., Ltdyanlianginspurhit@gmail.com,Peiyi HanHarbin Institute of Technology (Shenzhen)Peng Cheng Laboratoryhanpeiyi@hit.edu.cn,Haifeng ZhuHarbin Institute of Technology (Shenzhen)zhuhaifeng@stu.hit.edu.cn,Chuanyi LiuHarbin Institute of Technology (Shenzhen)Peng Cheng Laboratoryliu_chuan_yi@126.com,Shaoming DuanPeng Cheng Laboratoryduanshm@pcl.ac.cn,Cuiyun GaoHarbin Institute of Technology (Shenzhen)gaocuiyun@hit.edu.cnandYingwei LiangGuangdong Power Grid Co., Ltdliangyingwei@gdxx.csg.cn
Abstract.
Large Language Model-based (LLM-based) Text-to-SQL methods have achieved important progress in generating SQL queries for real-world applications.When confronted with table content-aware questions in real-world scenarios, ambiguous data content keywords and non-existent database schema column names within the question leads to the poor performance of existing methods.To solve this problem, we propose a novel approach towards Table Content-aware Text-to-SQL with Self-Retrieval (TCSR-SQL).It leverages LLM’s in-context learning capability to extract data content keywords within the question and infer possible related database schema, which is used to generate Seed SQL to fuzz search databases.The search results are further used to confirm the encoding knowledge with the designed encoding knowledge table, including column names and exact stored content values used in the SQL.The encoding knowledge is sent to obtain the final Precise SQL following multi-rounds of generation-execution-revision process.To validate our approach, we introduce a table-content-aware, question-related benchmark dataset, containing 1,692 question-SQL pairs. Comprehensive experiments conducted on this benchmark demonstrate the remarkable performance of TCSR-SQL, achieving an improvement of at least 13.7% in execution accuracy compared to other state-of-the-art methods.
1. Introduction
Text-to-SQL (NL2SQL) is an academic research field that aims to transform natural language questions into corresponding Structured Query Language (SQL) queries (Zhong etal., 2017; Wang etal., 2019). It can help people without programming skills to search data directly from databases relevant to the question, thus becoming a state of great interest and proliferation.Recently, the Large Language Model (LLM) technique has rapidly developed and demonstrated its ability to emerge knowledge from massive amounts of data in various domains (Wu etal., 2023; Han etal., 2023; Zheng etal., 2023; Sun, 2023) and complete tasks related to data analysis (Ding etal., 2022; Chiang and Lee, 2023; Shen etal., 2023; Cheng etal., 2023). The latest Text-to-SQL research also concentrates on designing LLM-based methods that use the database schema or data contents.Some of the representative methods employ solely database schema, including C3 (Dong etal., 2023), DIN-SQL (Pourreza and Rafiei, 2024a), DAIL-SQL (Gao etal., 2023), and Fin-SQL(Zhang etal., 2024). Moreover, some methods employ both data contents and database schema, such as SQLPrompt (Sun etal., 2023b), CodeS (Li etal., 2024b), and MAC-SQL (Wang etal., 2023).
However, these studies still have difficulties in dealing with the problem of real-world scenarios: ”How to generate SQL queries corresponding to table content-aware natural language questions?”.As shown in Figure 1, there are two difficulties in solving this problem.First, the column names used by the gold SQL query generally do not appear in the data content keywords within the question. In the example, data content keywords appearing in the question (‘the first quarter of 2023’, ‘GDP growth rate’) are not the same as the column names used in the gold SQL query (‘reportperiod’, ‘indexname’, ‘cumulative’).Thus it’s impossible for existing methods to directly use the data content keywords within the question as the column names, which may lead to errors in the generated SQL query.Second, the condition values used in the gold SQL query are not the same as the data content keywords within the question. In the example, the gold SQL query uses the condition (roworder=5) to match the data content keyword ‘GDP growth rate’ within the questionbecause it matches multiple values stored in the ‘indexname’ column (‘GDP growth(%)’, ‘GDP growth rate’, etc). Therefore, the gold SQL query solves this problem by utilizing database-specific encoding knowledge with documented correlations (column “roworder=5” corresponds to column “indexname LIKE ‘%GDP growth%’”).However, the natural language question does not specify this correlation, making existing methods unable to either confirm whether the consistency of data content keywords within the question and content values stored in the database or apply the encoding knowledge. Thus incorrect condition values may be used when generating SQL queries.
Based on the preceding discussion, there are three technical challenges for LLM-based Text-to-SQL methods to solve the above problem.First, it’s arduous to input the entire database contents into LLMs. Real-world industrial databases usually store huge amounts of data. Due to the limited length of LLM input tokens, it will exceed the maximal input token length to embed all database contents and send them directly to LLM, making it computationally difficult to handle. Using LLMs with a larger input token length to complete the Text-to-SQL task may result in high costs. Methods of adding data content samples (Wang etal., 2023; Li etal., 2024b) can not provide definitive results of the stored data content values used by SQL.Second, interference from other noisy data unrelated to the question tends to degrade the model performance. Real-world databases usually contain numerous data tables, rows, and columns, but only a small percentage of them are highly relevant to the question. Since the output result of LLMs will be directly affected by the input, directly inputting a large amount of irrelevant data into LLM may affect the LLM’s understanding of the question and lead to errors in the output (Patnaik etal., 2024; Kumar etal., 2023).Third, understanding all the data contents of databases is challenging. In practical database ecosystems, it is prevalent to find auxiliary data incorporated to facilitate maintenance tasks, such as an encoded knowledge table that meticulously maintains the alignment between domain-specific indexes and their corresponding encoded data columns. Inputting raw data from the database to LLM without prior processing may result in the loss of important prior knowledge, leading to errors in generating SQL queries that do not make use of such encoding knowledge.In summary, we need a more effective LLM-based Text-to-SQL method that satisfies three aspects: the ability to handle large databases with massive amounts of data, robustness to noisy data, and the ability to understand and utilize holistic database contents.
To address the aforementioned challenges, we propose towards Table Content-aware Text-to-SQL with Self-Retrieval (TCSR-SQL). The design of two self-retrieval techniques with fuzzy testing and Retrieval Augmented Generation (RAG) (Lewis etal., 2020) enablesTCSR-SQL to effectively obtain the database schema column names and exact stored content values, thus solving the above problem.Figure 2 illustrates the overall framework of TCSR-SQL, which comprises three modules: Keywords Extraction & Fuzzy Detection (Section 3.1), Knowledge Retrieval & Alignment (Section 3.2), and SQL Generation & Revision (Section 3.3).The Keywords Extraction & Fuzzy Detection module extracts the data content keywords from the question, speculates the related database schema, and searches the database for possible stored data content values.This module innovatively implements self-retrieval based on fuzzy testing, thus eliminating the need for inputting the complete data contents. It generates a series of Seed SQL queries to identify any possible data contents in the specified columns. This helps estimate the exact content values stored in the database corresponding to the data content keyword within the question.The Knowledge Retrieval & Alignment module serves to confirm the columns and values of the conditions used in the generated SQL queries based on the results of the first module.This module incorporates self-retrieval with the designed encoding knowledge table using RAG, thereby eliminating the impact of noisy data that is completely irrelevant to the question. Specifically, it incorporates both Seed SQL’s search results and the encoding knowledge table for condition confirmation. This significantly reduces the data scope to only database schema and data contents relevant to the question.The SQL Generation & Revision module generates a preliminary version of the SQL query (Fuzzy SQL) and iteratively modifies it to obtain the final result (Precise SQL).This module employs the findings of self-retrieval in the self-debugging process, thus enhancing the understanding of the database contents relevant to the question. The iterative generation-execution-revision process is achieved by using the SQL execution results and the encoding knowledge obtained in the second module. This facilitates an enhanced comprehension of the question, database schema, and data contents.
We create a Table Content-aware question-related benchmark test Dataset (TCD) of 1,692 question-SQL pairs to evaluate our method. This dataset was derived from multiple Spider variant datasets and real-world databases.The execution and exact-set-match accuracy of our method reaches 75% and 48.2%, representing an improvement of at least 13.7% and at most 37.4% over existing LLM-based Text-to-SQL SOTA methods, respectively.By analyzing the comparison results in representative test cases, our method can identify the exact column names and stored content values used in SQL queries from the database. Through multiple rounds of generation-execution-revision interactions between LLM and the database, our method can further improve the generation accuracy of SQL queries.
Our contributions can be summarized as follows:
- 1.
We propose TCSR-SQL, which incorporates self-retrieval techniques to solve the widespread problem of generating SQL queries corresponding to table content-aware natural language questions from databases with huge amounts of data.
- 2.
We create a novel benchmarking dataset called TCD which includes 1,692 question-SQL pairs to evaluate the effectiveness of solving the above problem.
- 3.
We validate the effectiveness of our method on the constructed test dataset. Experimental results demonstrate that our method improves the performance by 13.7% compared to existing Text-to-SQL SOTA methods.
2. Related Work
The latest Text-to-SQL studies pay close attention to utilizing Large Language Models (LLMs) for the Text-to-SQL tasks (Rajkumar etal., 2022; Liu etal., 2023). Aiming at surpassing the performance of traditional Text-to-SQL methods and further improving performance, existing LLM-based Text-to-SQL methods can be classified into two categories: database schema-based methods and data content-based methods.
2.1. Database Schema-based Text-to-SQL Methods
In the initial period of LLM-based Text-to-SQL methods, research efforts primarily focus on using only database schema and developing basic NLP techniques to fully exploit the capabilities of LLMs for this task. According to the type of techniques, these methods can be categorized into two groups.
Prompt-based methods.The first representative technique is prompt engineering, including least-to-most prompting, zero-shot prompting, few-shot/compound prompting, chain-of-thought (CoT), and self-debugging, etc.For example, C3 (Dong etal., 2023) designs a zero-shot prompting technique by bifurcating the Text-to-SQL task into two sub-tasks: schema linking and SQL generation. For the schema linking task, C3 designs the clear prompting and calibration bias prompting. The output SQL query is then selected with the highest confidence level through self-consistency (Wang etal., 2022).DIN-SQL (Pourreza and Rafiei, 2024a) designs a multi-round few-shot prompting strategy by decomposing the Text-to-SQL task into four subtasks: schema linking, classification and decomposition, SQL generation, and self-correction. It integrates the least-to-most prompting, chain-of-thought (CoT) prompting, and self-debugging techniques into the few-shot prompting technique, thereby elevating the Text-to-SQL performance of LLMs.DAIL-SQL (Gao etal., 2023) analyzes and integrates the advantages of existing prompt-based Text-to-SQL methods from three perspectives: question representation, example selection, and example organization. It chooses the code representation prompt (CR) in the question representation stage, masked question similarity selection (MQS) & query similarity selection (QRS) in the example selection stage, and SQL-only organization (SO) in the example organization stage.ACT-SQL (Zhang etal., 2023) provides a method to automatically generate auto-CoT Text-to-SQL exemplars, thereby eliminating the need for manual labeling.On the premise of providing database schema only, some recent studies focus on combining various prompt techniques for specific problems in SQL generation. (Nan etal., 2023; Domínguez etal., 2024; Xie etal., 2024; Ren etal., 2024)
However, the above methods face several limitations. Limited by the length of LLM input tokens and the database schema-only input, it’s difficult to retrieve the full contents of a large real-world database. This can lead to errors in the database storage values used in the conditions of the generated SQL, as well as incorrect database column names used in the generated SQL.
Fine tuning-based methods.With the rapid iteration of LLM techniques and the large-scale emergence of open-source LLM models, there are several studies devoted to training general LLMs into Text-to-SQL domain-specific LLMs by using fine-tuning techniques.As an initial attempt, SQL-PALM (Sun etal., 2023a) and DAIL-SQL (Gao etal., 2023) utilize the Spider training set(Yu etal., 2018) to discover the fine-tuning effectiveness of the closed-source PaLM-2(Anil etal., 2023) and several open-sourced LLMs, separately.Fin-SQL (Zhang etal., 2024) uses LoRA (Hu etal., 2021), a common technique for LLM fine-tuning, to design the LoRA-based Multi-Task PEFT method. It fine-tunes different open-source LLMs to complete the Text-to-SQL task for financial domain databases, including LLaMA-2 (Touvron etal., 2023) and Baichuan-2 (Yang etal., 2023).DTS-SQL (Pourreza and Rafiei, 2024b) proposes a two-stage fine-tuning approach for data privacy protection issues using open-source LLMs with a small number of parameters. By decoupling the LLM fine-tuning task into both schema linking fine-tuning and SQL generation fine-tuning, it improves the performance of fine-tuned LLMs to a level close to existing prompt-based Text-to-SQL methods using closed-source LLMs such as GPT-4 (OpenAI, 2023b).
These fine tuning-based methods have three limitations. First, most of them use the training dataset of the specified benchmark to fine-tune LLMs. This results in poor robustness of fine-tuned LLM when migrating to Text-to-SQL tasks with other irrelevant data. Second, the Text-to-SQL domain-specific fine tuning process implicitly teaches LLM to capture the correlations between fine tuning data, questions, and SQL queries. The accuracy might be significantly reduced when dealing with issues that necessitate explicit database-associated encoding knowledge. Third, these methods’ inputs still use only database schema, which may lead to incorrect condition values used in the SQL.
2.2. Data Content-based Text-to-SQL Methods
In real-world Text-to-SQL applications, researchers have identified that manually writing SQL queries is likely to result in the invocation and viewing of the data content within the database. Therefore, recent LLM-based Text-to-SQL methods also consider adding data contents to further improve the task performance.
LLM-only methods.Some of them directly add several data content samples based on the prompt-based and fine tuning-based methods mentioned above.For prompt-based methods, SQLPrompt (Sun etal., 2023b) adds relevant database content values in the input prompts based on SQL-PALM (Sun etal., 2023a).DFIN-SQL (Volvovsky etal., 2024) enhances the schema linking module of the original DIN-SQL (Pourreza and Rafiei, 2024a) by adding sample rows with data contents.PET-SQL (Li etal., 2024a) utilizes data content samples in its schema linking and finSQL generation prompt. It further introduces diverse LLMs to facilitate the cross consistency and fine-grained voting.For fine tuning-based methods, CodeS (Li etal., 2024b) incorporates data content samples into the designed database prompt construction session to fine-tune the open-source LLM StarCoder (Li etal., 2023) and obtain the Text-to-SQL domain-specific LLM.
However, there is a significant problem with the above methods. The data contents provided to LLM are combined with other related information within the prompt. It is uncertain to determine whether LLM understands and applies the data contents since there is no feedback. Furthermore, it’s unclear how the data contents can be used to help improve the accuracy of SQL generation.
Agent-based methods.Recently, the LLM-based agent technique has attracted considerable attention due to its ability to introduce additional information to significantly enhance the performance of a single LLM across various tasks. Several studies in the Text-to-SQL domain explore the possibility of agent-based Text-to-SQL methods.For example, Guo et al. (Guo etal., 2023) and SQL-CRAFT (Xia etal., 2024) both incorporate feedback on the execution of SQL queries from the database into the prompt. They design a dynamic revision chain and an automated loop control algorithm to guide LLMs to refine their generated SQL queries, separately.Knowledge-to-SQL (Hong etal., 2024) and Dubo-SQL (Thorpe etal., 2024) generate SQL queries based on the idea of Retrieval Augmented Generation (RAG) (Lewis etal., 2020). The former introduces a trained Data Expert LLM (DELLM), which is trained using the gold knowledge provided in the BIRD benchmark (Liu etal., 2022). The DELLM can generate additional knowledge and provide it to LLMs, thereby improving the accuracy of SQL generation. The latter claims to introduce diverse RAG by comparing vector embeddings of questions and selecting few-shot examples from the BIRD training set.MAC-SQL (Wang etal., 2023) designs three LLM-based agents (i.e., selector, decomposer, and refiner) and uses the multi-agent collaboration framework to address challenges in Text-to-SQL tasks.
Although these methods provide data content samples, they are still unable to directly use the exact content values stored in the database as accurate conditional values in generating SQL queries.Therefore, we propose self-retrieval to integrate database content comprehension with the SQL generation-execution-revision process. This approach addresses the challenge of accurately generating SQL queries using both database exact stored content values and corresponding database schema column names encountered in real-world scenarios.
3. TCSR-SQL
As mentioned in the Introduction Section, existing LLM-based Text-to-SQL methods face two difficulties to accurately generate SQL queries according to table content-aware natural language questions from a large database.The first difficulty is identifying database schema column names used by the SQL query based on the data content keywords within the question. The second difficulty is identifying the exact content values stored in the database used by the SQL query, based on the inaccurate data content keywords within the question.As depicted in Figure 2, we propose TCSR-SQL to overcome these difficulties, which consists of three modules:
- 1.
Keywords Extraction & Fuzzy Detection. This module takes a natural language question, database schema, and database content samples as input.It then extracts the data content keywords from the question and all the possible related columns in the database. They are further sent to the designed Fuzzy Detection session to search for potentially relevant content values stored in the database.
- 2.
Knowledge Retrieval & Alignment. This module takes the output of the first module (i.e., data content keywords and search results for potentially relevant content values) as input.It then tries to identify the exact stored content values corresponding to the data content keywords according to the search results. For unascertainable situations, it matches the alternative column names and stored content values using the designed encoding knowledge table.The final output is the encoding knowledge for each data content keyword in the original question, including both the column names and exact content values stored in the database.
- 3.
SQL Generation & Revision. This module takes the original input from the first module and the encoding knowledge from the second module as input.It then generates the Fuzzy SQL based on the question, database schema, and database content samples.The Fuzzy SQL is further modified based on the database column names and exact stored content values provided in the encoding knowledge.The output is the Precise SQL, achieved by executing the SQL and revising incorrect Fuzzy SQL according to the execution feedback.
3.1. Keywords Extraction & Fuzzy Detection
Through analysis of the schema linking session of C3 (Dong etal., 2023) and DIN-SQL (Pourreza and Rafiei, 2024a), we find that existing schema linking methods directly identify references to database schema and data content keywords in natural language questions. This helps synthesize complex SQL queries across multiple tables from different domains.However, these methods cannot directly retrieve the database schema from the questions for SQL query generation when there are only inaccurate data content keywords within the original questions. Under such circumstances, it becomes more difficult to directly find the available column names and the corresponding exact content values stored in large databases.Inspired by Liu et al. (Liu etal., 2022), their proposed schema detection problem within the Table Question Answering (TableQA) testing benchmark is very similar to the first difficulty, which is the focus of this paper. Therefore, we propose the Keywords Extraction & Fuzzy Detection module in this section. Based on the schema detection problem, we design the self-retrieval technique with the idea of fuzzy testing in the field of computer security and the above traditional schema linking method.
Specifically, the first objective of the Keywords Extraction & Fuzzy Detection module is to identify the database schema used in the generated SQL query based on the inaccurate data content keywords extracted from the natural language question. Another objective of this module is to determine the content values stored in the database as much as possible that are associated with the data content keywords.The technical challenges lie in two aspects. First, it’s impossible to input the complete table contents of a large database into LLM due to the upper limit of the input context token length of large language models (LLMs). Second, generating SQL queries for natural language questions only requires a very small proportion of the data stored in the database, resulting in interference from other noisy data unrelated to the question.
Algorithm 1 shows the overall workflow of the Keywords Extraction & Fuzzy Detection module.First, we design a few-shot prompt-based method for keyword extraction. The input prompt consists of a natural language question, database schema, and four randomly selected data samples for each table in the database. LLM is then used to extract the data content keywords within the natural language question, leveraging its in-context semantic learning capability. Based on the extracted data content keywords and provided table samples, LLM further confirms the required database schema (i.e., table names and column names).It should be emphasized that we require LLM to identify a unique table name and several associated column names. The purpose of this approach is to maximize the probability of finding the stored content values associated with the inaccurate data content keyword within the question, while accurately assigning the keyword to the related data table at the same time. It allows for the verification of the column names and content values stored in the database that are required for generating the final SQL query. and in Figure 2 illustrate an example and the full prompt can be found in Appendix A.1.
Based on the data content keywords extracted from the question and the corresponding database table names and column names, we design a few-shot prompt-based method for fuzzy detection by innovatively combining the idea of dynamic fuzzy testing with the Text-to-SQL task.In the previous keywords extraction session, we have mentioned the goal of maximizing the search probability of the database for stored content values associated with the inaccurate data content keyword within the question.Therefore, the fuzzy detection session needs to generate a series of SQL queries (i.e., Seed SQL) to search for data content values related to the data content keywords in the specified columns. These Seed SQL queries will be used in the subsequent fuzzy detection session. The search result sub-tables are sent to the Knowledge Retrieval & Alignment module to identify the exact stored content value corresponding to the inaccurate data content in the question.In accordance with a typical mutation-based fuzzing workflow (Serebryany, 2016), we introduce LLM as a tool for generating a continuous stream of diverse Seed SQL queries to facilitate the fuzzy detection session:
- 1.
Type-aware mutation. Inspired by Liu et al. (Liu etal., 2024), we divide the Seed SQL into three types of mutually independent components (i.e., seed type) in this process. For each component (i.e., seed), we initialize the seed pool and bootstrap the Seed SQL generation pipeline.As shown in Table 1, a single Seed SQL has three types of mutable components: column names, data contents, and SQL skeletons.For column names, we set all possible related database column names obtained in the previous keywords extraction session as the seed pool.For data contents, we introduce LLM to predict the synonyms of the data content keywords within the original question. The input prompt includes data content keywords and data samples from related columns. We set the synonyms suggested by LLM based on the input prompt as the seed pool.For SQL skeletons, we set the following two candidate SQL query templates as the seed pool:
⬇
SQL1: SELECT DISTINCT {CN} FROM {TN}
WHERE {CN} = {DC};
⬇
SQL2: SELECT DISTINCT {CN} FROM {TN}
WHERE {CN} LIKE ‘%{DC}%’.’
Figure 2 illustrates several examples of the Seed SQL and the full prompt can be found in Appendix A.2.
- 2.
Seed SQL generation & execution. For each data content keyword , a seed is randomly selected from the seed pool each time to be mutated into a new seed SQL. The new Seed SQL is then executed against the database and returns the search results. We repeat the whole generation process until all the seeds have been selected.
Component Type | Mutation | ||||
|
| ||||
|
| ||||
SQL Skeleton |
|
3.2. Knowledge Retrieval & Alignment
Through the processing of the above Keywords Extraction & Fuzzy Detection module, our proposed method is able to obtain a series of search results for each inaccurate data content keyword within the question, including table names and column names from the database schema, and content values stored in the database.However, these search results cannot be used directly to identify the column names and the corresponding content values stored in the database when determining the usage of conditions in the final SQL query. As a result, such supplementary information has not been fully utilized in improving the generated accuracy of the final SQL query.Since the Text-to-SQL task is a kind of knowledge-intensive NLP task, it’s necessary to consider and make use of the data contents in the database associated with the question in a more comprehensive way. Therefore, we propose the Knowledge Retrieval & Alignment module in this section.Based on the existing idea of the Retrieval Augmented Generation (RAG) (Lewis etal., 2020), we design the self-retrieval technique by integrating these additional search results to help identify which column names and exact content values stored in the database should be used in generating SQL query for each data content keyword within the original question.
Specifically, the objective of the Knowledge Retrieval & Alignment module is to identify the exact content value stored in the database by using the database’s value search results of related data content keywords provided in the above Keywords Extraction & Fuzzy Detection module.The technical challenge is that it is difficult for LLMs to fully comprehend all the contents of the database and the database-specific encoding knowledge, leading to multiple errors when generating the final SQL query.
Algorithm 2 shows the overall workflow of the Knowledge Retrieval & Alignment module. First of all, we start the knowledge retrieval session by categorizing the search results from the output of the fuzzy matching session:
- 1.
Case 1. The search results indicate that there is a unique content value exactly stored in the associated column corresponding to the data content keyword within the question. We normalize the search results to encoding knowledge and add it into the encoding knowledge table as available candidate knowledge for generating the final SQL query.Here, the normalization process involves the data content keyword within the original question, the associated database name, the associated table name, the associated column name, and the exact stored content value in JSON format. The normalization result is then merged into the encoding knowledge table, which corresponds to the data content keywords within the original question.
- 2.
Case 2. The search results indicate that there are multiple content values stored in the associated column corresponding to the data content keyword within the question, or the exact content value cannot be found in the currently associated column. Therefore, we consider the current search results invalid and discard them directly.
It is important to note that we design an innovative RAG-based encoding knowledge table and introduce it in both sessions of the Knowledge Retrieval & Alignment module.Specifically, the encoding knowledge table consists of five columns: the data content keyword within the question, the associated database name, the associated table name, the associated column name, and the exact stored content value.It mainly contains two types of knowledge. The first type is derived from the relationship-matching table existing in the original database, which records the correspondence between the domain synonyms and specific encoding columns stored in the database. For example, in Figure 2: the data content keyword within the question has multiple synonyms stored in the related column of the database. This situation causes the standard SQL query to apply the knowledge of the relationship matching table (i.e., ).The second type of knowledge is derived from the candidate knowledge supplemented by the knowledge retrieval session. For example, in Figure 2: the data content keyword within the question corresponds to a unique content value stored in the related column of the database. After normalization, it is added to the encoding knowledge table as a new candidate knowledge.Introducing the encoding knowledge table has two advantages. First, it contains the encoding knowledge used in the database construction process, which helps LLM to understand the database content more comprehensively and improve the accuracy of generating SQL queries. Second, the encoding knowledge table is dynamic. The continuous updating of knowledge during the knowledge retrieval session can assist LLM in gaining a better understanding of the data content of a specific database and improve the accuracy of generating SQL queries.
Based on the updated encoding knowledge table output from the knowledge retrieval session, we develop the knowledge alignment session aiming at further understanding the database content related to the inaccurate data content keywords within the natural language question.Specifically, we use the embedding model to compute the embedding representation of the data content keywords within the question and each piece of knowledge in the encoding knowledge table:
(1) | |||
where denotes the -th data content keyword extracted from the natural language question, and denotes the -th knowledge in the encoding knowledge table.Next, we calculate the cosine similarity between the embedding of the data content keyword and the embedding of the encoding knowledge:
(2) |
Then, we choose the encoding knowledge with the highest cosine similarity score for each data content keyword within the question. The knowledge contains the column name from the database schema and the exact content value stored in the database, which can be used as the available knowledge to generate the final SQL query.Finally, all of the chosen encoding knowledge is sent into the next SQL generation module to modify the Fuzzy SQL and produce the final Precise SQL.
3.3. SQL Generation & Revision
As mentioned in the Knowledge Retrieval & Alignment module, we obtain encoding knowledge including column names and exact content values stored in the database corresponding to the data content keywords within the question, which could be used in the generation of the final SQL query. Furthermore, the inclusion of the self-correction techniques (Pourreza and Rafiei, 2024a; Wang etal., 2023; Chen etal., 2023) is essential for checking and correcting the generated answers, especially in the context of Text-to-SQL tasks (Wang etal., 2023).Based on the generation-execution-revision framework, we combine the obtained encoding knowledge and existing self-correction techniques, and propose the SQL generation & revision module.
Specifically, the objective of the SQL Generation & Revision module is to generate SQL queries using encoding knowledge and automatically rectify them based on the error feedback.The overall workflow of the SQL Generation & Revision module is as follows. First, we generate a preliminary SQL query called Fuzzy SQL based on the natural language question, the database schema, and the database content samples.We call it Fuzzy SQL because this SQL query is generated by LLM after fully understanding the question based on the input prompt. Although the Fuzzy SQL has a complete SQL skeleton, the column names and used data contents are directly derived from the original natural language question, which is uncertain and might be modified.Next, we introduce LLM to revise the Fuzzy SQL using the encoding knowledge. The encoding knowledge includes column names and precise content values corresponding to the data content keywords within the question.The revised Fuzzy SQL is then executed to evaluate its correctness, feasibility, and ability to retrieve non-empty results from the database. If the evaluation approves, the result will be output as the final Precise SQL. Otherwise, the result will be further revised and re-evaluated.The above process is repeated until the result is correct or the number of revisions reaches the upper limit. Module 3 in Figure 2 illustrates an example and the full prompt can be found in Appendix A.3 and A.4.
4. Evaluation
4.1. Experiment Setup
4.1.1. Datasets
To evaluate the performance on the table content-aware questions in real-world scenarios, we create a Table Content-aware question-related benchmark test Dataset (TCD).As shown in Table 2, we select QA pairs related to the real-world difficulties from the following four datasets and remove those with identical problems, empty execution results, or semantic errors. In addition, we manually annotate 423 QA pairs based on the two real-world business databases, for a total of 1,692 QA pairs.
Spider-DK(Gan etal., 2021a) modifies question-SQL (QA) pairs in the Spider-dev set(Yu etal., 2018) by adding domain knowledge that reflects the interpretation of real-world difficulties. They consider five types of questions involving domain knowledge, three of which are related to the difficulties discussed in the introduction:T3 requires the model to recognize cell value synonym substitution; T4 requires the model to understand conditions for encoding corresponding to the question; T5 requires the model to understand the ambiguous column names within the question.
Spider-Syn(Gan etal., 2021b) modifies natural language (NL) questions in the Spider benchmark by replacing column names and cell values with manually selected synonyms. It simulates the real-world difficulties that users ask questions without knowing the exact database schema or cell values.
Spider-Realistic(Deng etal., 2021) considers the difficulty that column names mentioned in NL questions are different from those stored in databases. They select a complex subset from the Spider-dev set where there are columns compared against values or used in clauses like ORDER BY and manually modify the questions to remove explicit mention of column names.
Dr.Spider(Chang etal., 2022) designs 17 datasets after perturbation of databases, NL questions, and SQL queries to measure robustness from different perspectives. Two of the datasets are related to the difficulties mentioned in the introduction. In column-value, the column name used in SQL needs to be inferred from the value mentioned in the question. In value-synonym, values mentioned in the question are replaced with synonyms.
Datasets | Original | Selected |
Spider-DK | 535 | 107 |
Spider-Syn | 1034 | 254 |
Spider-Realistic | 508 | 210 |
Dr.Spider-NLQ_column_value | 304 | 259 |
Dr.Spider-NLQ_value_synonym | 506 | 439 |
Real-world | - | 423 |
Sum | - | 1692 |
4.1.2. Evaluation Metrics
Execution Accuracy (EX) compares the execution output of the predicted SQL query to that of the gold SQL query on a question.Since there may be multiple valid SQL queries for a single question, the EX metric can match SQL queries that execute with the same result regardless of differences in the string representation between the generated SQL and the gold SQL.
Exact-set-match Accuracy (EM) compares each clause in the predicted SQL to that in the gold SQL.A predicted SQL query is considered if all of its components match the ground truth. Although not meet the requirements of exact-set-match, some predicted SQL can still be executed to produce correct results. Therefore, the EM metric is only used as a reference result.
4.1.3. Baselines
We conduct experiments on the constructed test dataset and compare TCSR-SQL with representative baselines of database schema-based and data content-based methods:
C3-SQL(Dong etal., 2023) is a zero-shot Text-to-SQL method consisting of three key components: Clear Prompting (CP), Calibration with Hints (CH), and Consistency Output (CO), which correspond to the model input, model bias and model output respectively.
DIN-SQL(Pourreza and Rafiei, 2024a) decomposes text-to-SQL tasks into smaller sub-tasks and designs different prompts for each sub-task to instruct GPT to complete each sub-task and obtain the final SQL.
DAIL-SQL(Gao etal., 2023) selects few-shot demonstrations based on their skeleton similarities and removes cross-domainknowledge from examples for token efficiency.
MAC-SQL(Wang etal., 2023) is a novel LLM-based multi-agent collaborative framework for Text-to-SQL consisting of a selector, a decomposer, and a refiner.
CodeS(Li etal., 2024b) is a series of open-source, highly accurate LLMs tailored for text-to-SQL tasks. It overcomes the limitations of closed-source models with superior performance and smaller parameter sizes. We choose the CodeS-15b-spider model with the maximum parameters and fine-tuned on the Spider dataset(Yu etal., 2018) as the baseline.
4.1.4. Settings
We use the GPT-3.5-turbo(OpenAI, 2023a) in our experiments and the temperature is set to 0.We choose the Dmeta-embedding algorithm 111https://huggingface.co/DMetaSoul/Dmeta-embedding as the embedding model for similarity calculation in the Knowledge Retrieval & Alignment module. Table 3 shows the configuration of our experiments.
Parameters | Values |
GPT Model API | gpt-3.5-turbo-0125 |
Temperature | 0 |
Number of Data Content Samples | 6 |
Examples of Keywords Extraction | 4-shot |
Examples of Fuzzy Detection | 9-shot |
Examples of SQL Generation | 4-shot |
Examples of SQL Revision | 0-shot |
4.2. Overall results
Table 4 shows the EX and EM accuracy of TCSR-SQL compared with all baseline methods. The best are marked in bold and the second-best are underlined. TCSR-SQL reaches 75.0% and 48.2%, with an improvement of at least 13.7% compared to the best result of other SOTA methods.Relying on zero-shot prompting, C3-SQL utilizes execution outcomes to filter for self-consistency. However, its ability to infer conditional values is poor due to the lack of checking of data contents. Thus the EX accuracy performs the worst (37.6%) and the EM accuracy is only 26.8%.The EX results of DIN-SQL and DAIL-SQL based on few-shot prompting are about 40%, while the EM results are both less than 40%. We still use Spider-train as the training set, and DAIL-SQL selects more suitable demonstration examples based on different questions. Although the DIN-SQL prompt is carefully designed, its demonstration examples are fixed for Spider. In addition, these two methods are not based on execution feedback, while the execution results happen to be the direct basis for correctness. Although DIN-SQL has a self-correction mechanism, the reference for correction does not include the execution results.MAC-SQL refers to database values, simplifies the database schema, decomposes the question, and uses execution feedback results to correct the generated SQL queries. Although the EM accuracy performs the worst, the EX accuracy reaches 55.1%, which is at least 10% higher than other methods without reference data value samples. This indicates that it has a good understanding of the semantics of the problem, although it does not adopt a sentence structure similar to the gold SQL.It’s noteworthy that CodeS-15b-spider achieves the highest EM accuracy of 53.5% probably because of its utilization of the Spider-train for time-consuming fine-tuning to generate similar SQL skeletons. Although it has a good inference abilityby adding the most relevant values extracted from the data content in the prompt, the EX accuracy remains 13.7% lower.TCSR-SQL can better obtain the correct conditional values in SQL queries by extracting keywords, detecting database schema and data contents, utilizing encoding knowledge, and feedback correction based on execution results. Therefore, TCSR-SQL achieves significantly higher results compared to all the baselines.
Methods | Execution (EX) | Exact-set-match (EM) |
TCSR-SQL | 75.0 | 48.2 |
C3-SQL | 37.6 | 26.8 |
DIN-SQL | 40.8 | 35.0 |
DAIL-SQL | 42.3 | 39.9 |
MAC-SQL | 55.1 | 22.1 |
CodeS-15b-spider | 61.3 | 53.5 |
4.3. Ablation Experiments
In this section, we evaluate the impact of the existence of each module, as shown in Table 5.
w/o Keywords Extraction & Fuzzy Detection. Since no keywords are extracted, we directly use the natural language question to search in the knowledge base and add the first three retrieved pieces of knowledge to the prompt as references.
The Keywords Extraction & Fuzzy Detection module can find related tables and columns through keyword extraction. Without this module, the EX drops by 7.6%, while the EM rises a little. In this module, LLM performs keyword extraction and preliminarily confirms whether it is related to the database schema or data values, thereby further detecting keywords related to data values in the corresponding columns. The keywords related to data contents are often values in SQL conditions, while keywords related to database schema are often columns in SQL conditions or used for selection. Incorrect data values or columns will directly lead to the generation of incorrect SQL queries. Without this module, the keywords within the question can’t be explicitly linked and further explored or searched for knowledge, resulting in the inability to delve into the database directly to check the columns and exact stored values.
w/o Knowledge Retrieval & Alignment. We directly use the results of Keywords Extraction & Fuzzy Detection to help generate SQL and skip the Knowledge Retrieval & Alignment module.
The Knowledge Retrieval & Alignment module finds the encoded representation of relevant semantics in the database by retrieving encoding knowledge. Without this module, the EX and EM drops by 9.3% and 2.6%, respectively. The encoding knowledge table stores knowledge that cannot be inferred solely through natural language questions, database schema, or data contents. Encoding knowledge is necessary in the real world, especially in practical business. The ability to use encoding knowledge is an important condition for LLM-based methods to replace human labor in Text-to-SQL tasks. The lack of this module poses a challenge for TCSR-SQL in determining conditional values, thus proving its necessity.
w/o Revision. After generating SQL, no correction is made through execution feedback and the results are directly output.
The SQL Revision module executes the generated SQL on the database instance and corrects SQL that cannot be executed or returns empty results, while SQL that can be executed to the result does not need to be corrected. Without this module, the EX and EM decreases by 6.4% and 1.8% respectively. The execution results of SQL are divided into three types: 1. can be executed normally and obtain results, 2. return empty results after execution, and 3. cannot be executed normally and report errors. This module collects error or empty result information, feeding it back to LLM for correction until the SQL executes successfully and yields a result, or the maximum attempt limit is reached. Since SQL which cannot be executed or has no execution result is wrong, the SQL Revision module can help correct these SQL and improve the performance.
Methods | EX | EM |
TCSR-SQL | 75.0 | 48.2 |
w/o Keywords Extraction & Fuzzy Detection | 67.4(↓) | 49.5 |
w/o Knowledge Retrieval & Alignment | 65.7(↓) | 45.6 |
w/o Revision | 68.6(↓) | 46.4 |
To further explore the impact of each module in each step of TCSR-SQL, we start by directly generating SQL and add three modules in sequence. That is, we evaluate TCSR-SQL without all three modules, with only the Keywords Extraction & Schema Detection module, and without the SQL Revision module.As shown in Figure 3, it can be observed that each module within the process has contributed to a notable improvement of approximately 8% to EX, while the first and third modules contribute to EM. This indicates that all three modules are indispensable.
4.4. Case study
To better understand how the three modules in TCSR-SQL help LLM generate SQL that can be executed correctly, we demonstrate two example cases selected from the dataset. From the table, we can see that previous methods cannot obtain the correct data content values as query conditions because they cannot check the database content and utilize encoding knowledge. In the first case, the keywords related to the data values in the question are not aligned with the values stored in the database. MAC-SQL infers some data values through data samples, but other methods cannot infer the conditional data values of the SQL, causing the empty execution results of the generated SQL. In the second case, the condition value mentioned in the question also differs from the value stored in the database. Since a small number of data samples cannot cover this data value, MAC-SQL also directly utilizes the keyword within the question as the conditional data value. In addition, the column name Make where the conditional value is located is not mentioned in the question, so relying on LLM’s implicit inference leads to the wrong chosen column Model.
When applying TCSR-SQL in case 1, as shown in Figure 5, the Keywords Extraction & Schema Detection module will extract data content keywords and database schema keywords. Specifically, this module uses LLM to infer the tables and columns corresponding to relevant keywords and checks the data contents in the table to obtain exact stored data values. For keywords that can’t obtain stored data values, the Knowledge Retrieval & Alignment module retrieves its corresponding encoding knowledge in the encoding knowledge table. By utilizing the encoding knowledge obtained from the above two modules, LLM generates SQL query conditions correctly.
Similarly in case 2, as shown in Figure 6, LLM uses the encoding knowledge obtained by the two modules to generate SQL. The difference is that the condition of the generated SQL query is still wrong. The SQL Generation & Revision module obtains feedback that the result is empty by executing the SQL and uses it to regenerate SQL that can be executed correctly.
5. Conclusion
In this paper, we propose towards Table Content-aware Text-to-SQL with Self-Retrieval (TCSR-SQL). TCSR-SQL utilizes self-retrieval to generate SQL queries with both database exact stored content values and corresponding database schema column names, which solves the widespread problem encountered in real-world scenarios.The Keywords Extraction & Fuzzy Detection module infers possible related columns and relevant content values stored in the database corresponding to the question. The Knowledge Retrieval & Alignment module identifies the exact column names and content values stored in the database used in the SQL queries. The SQL Generation & Revision module generates the final SQL query according to multi-turn execution feedback of each SQL revision result. Extensive experimental results show the superior performance of TCSR-SQL compared to other SOTA LLM-based Text-to-SQL methods in terms of two metrics.Future work will involve the application of different LLMs to specific tasks and the expansion of modules to address more issues encountered in Text-to-SQL generation.
Appendix A Prompts
A.1. Keywords Extraction Prompt Template
⬇
### First, find schema keywords and data content keywords in the given question. Then, for each keyword, try to identify the possible corresponding tables and columns using the value examples provided.
==========
{4 demonstrations}
==========
### Sqlite SQL tables, with their table names, column names and data value examples:
{desc_str}
### Foreign keys of SQLite tables, used for table joins:
{fk_str}
### Question:
{query}
### Answer:
A.2. Fuzzy Detection Prompt Template
⬇
### Based on data samples extracted from the related columns, infer the possible storage values (up to 5) of the data content keyword in the database. Please provide them in the form of a list.
{9 demonstrations}
# data content keyword: {keyword}.
# data samples for column {column}: {datasamples}
# possible storage values:
A.3. SQL Generation Prompt Template
⬇
### Refer to the provided table and encoding knowledge, follow the requirements, and use valid SQLite to answer the question.
### Requirements:
# In ‘SELECT <column>‘, just select needed columns in the Question without any unnecessary column or value
# If the same column name appears in multiple tables, specify the table to which it belongs to avoid ambiguity.
==========
{4 demostrations}
==========
### Sqlite SQL tables, with their table names, column names and data value examples:
{desc_str}
### Foreign keys of SQLite tables, used for table joins:
{fk_str}
### Question:
{query}
### Encoding knowledge for SQL generation:
{related_prompt}
### SQL:
SELECT
A.4. SQL Revision Prompt Template
⬇
### When executing SQL below, some errors occurred, please fix up SQL based on query and database info.
### Solve the task step by step if you need to. Using SQL format in the code block, and indicate script type in the code block.
### When you find an answer, verify the answer carefully.
### Notes & Examples:
# In ‘SELECT <column>‘, just select needed columns in the Question without any unnecessary column or value
# Don’t use ‘IN‘, ‘OR‘, ‘LEFT JOIN‘ as it might cause extra results, use ‘JOIN‘, ‘INTERSECT‘, ‘EXCEPT‘ instead
# Use ‘DISTINCT‘, ‘DESC‘, or ‘LIMIT‘ when necessary
# In ‘FROM <table>‘ or ‘JOIN <table>‘, do not include unnecessary table
# In ‘JOIN <table>‘, make sure the Foreign keys used in the SQL is correct
### Question:
{query}
### Encoding knowledge for SQL generation:
{related_prompt}
### Sqlite SQL tables, with their table names, column names and data value examples:
{desc_str}
### Foreign keys of SQLite tables, used for table joins:
{fk_str}
### Old SQL:
{old_sql}
### SQLite error:
{sqlite_error}
### Exception class:
{exception_class}
### Now please fixup old SQL and generate new SQL only and with no explanation.
### correct SQL:
References
- (1)
- Anil etal. (2023)Rohan Anil, AndrewM Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, etal. 2023.Palm 2 technical report.arXiv preprint arXiv:2305.10403 (2023).
- Chang etal. (2022)Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, AlexanderHanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, etal. 2022.Dr. Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness. In The Eleventh International Conference on Learning Representations.
- Chen etal. (2023)Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023.Teaching large language models to self-debug.arXiv preprint arXiv:2304.05128 (2023).
- Cheng etal. (2023)Liying Cheng, Xingxuan Li, and Lidong Bing. 2023.Is gpt-4 a good data analyst?arXiv preprint arXiv:2305.15038 (2023).
- Chiang and Lee (2023)Cheng-Han Chiang and Hung-yi Lee. 2023.Can large language models be an alternative to human evaluations?arXiv preprint arXiv:2305.01937 (2023).
- Deng etal. (2021)Xiang Deng, Ahmed Hassan, Christopher Meek, Oleksandr Polozov, Huan Sun, and Matthew Richardson. 2021.Structure-Grounded Pretraining for Text-to-SQL. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1337–1350.
- Ding etal. (2022)Bosheng Ding, Chengwei Qin, Linlin Liu, YewKen Chia, Shafiq Joty, Boyang Li, and Lidong Bing. 2022.Is gpt-3 a good data annotator?arXiv preprint arXiv:2212.10450 (2022).
- Domínguez etal. (2024)JoséManuel Domínguez, Benjamín Errázuriz, and Patricio Daher. 2024.Blar-SQL: Faster, Stronger, Smaller NL2SQL.arXiv preprint arXiv:2401.02997 (2024).
- Dong etal. (2023)Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, etal. 2023.C3: Zero-shot text-to-sql with chatgpt.arXiv preprint arXiv:2307.07306 (2023).
- Gan etal. (2021b)Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew Purver, JohnR Woodward, Jinxia Xie, and Pengsheng Huang. 2021b.Towards Robustness of Text-to-SQL Models against Synonym Substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2505–2515.
- Gan etal. (2021a)Yujian Gan, Xinyun Chen, and Matthew Purver. 2021a.Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8926–8931.
- Gao etal. (2023)Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2023.Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363 (2023).
- Guo etal. (2023)Chunxi Guo, Zhiliang Tian, Jintao Tang, Shasha Li, Zhihua Wen, Kaixuan Wang, and Ting Wang. 2023.Retrieval-augmented gpt-3.5-based text-to-sql framework with sample-aware prompting and dynamic revision chain. In International Conference on Neural Information Processing. Springer, 341–356.
- Han etal. (2023)Tianyu Han, LisaC Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and KenoK Bressem. 2023.MedAlpaca–an open-source collection of medical conversational AI models and training data.arXiv preprint arXiv:2304.08247 (2023).
- Hong etal. (2024)Zijin Hong, Zheng Yuan, Hao Chen, Qinggang Zhang, Feiran Huang, and Xiao Huang. 2024.Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM.arXiv preprint arXiv:2402.11517 (2024).
- Hu etal. (2021)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685 (2021).
- Kumar etal. (2023)Vishwajeet Kumar, Yash Gupta, Saneem Chemmengath, Jaydeep Sen, Soumen Chakrabarti, Samarth Bharadwaj, and Feifei Pan. 2023.Multi-row, multi-span distant supervision for Table+ Text question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8080–8094.
- Lewis etal. (2020)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, etal. 2020.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Li etal. (2024b)Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024b.CodeS: Towards Building Open-source Language Models for Text-to-SQL.arXiv preprint arXiv:2402.16347 (2024).
- Li etal. (2023)Raymond Li, LoubnaBen Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, etal. 2023.Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161 (2023).
- Li etal. (2024a)Zhishuai Li, Xiang Wang, Jingjing Zhao, Sun Yang, Guoqing Du, Xiaoru Hu, Bin Zhang, Yuxiao Ye, Ziyue Li, Rui Zhao, etal. 2024a.PET-SQL: A Prompt-enhanced Two-stage Text-to-SQL Framework with Cross-consistency.arXiv preprint arXiv:2403.09732 (2024).
- Liu etal. (2023)Aiwei Liu, Xuming Hu, Lijie Wen, and PhilipS Yu. 2023.A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability.arXiv preprint arXiv:2303.13547 (2023).
- Liu etal. (2024)Jiawei Liu, ChunqiuSteven Xia, Yuyao Wang, and Lingming Zhang. 2024.Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2024).
- Liu etal. (2022)Ruixue Liu, Shaozu Yuan, Aijun Dai, Lei Shen, Tiangang Zhu, Meng Chen, and Xiaodong He. 2022.Few-shot table understanding: A benchmark dataset and pre-training baseline. In Proceedings of the 29th International Conference on Computational Linguistics. 3741–3752.
- Nan etal. (2023)Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, and Dragomir Radev. 2023.Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.arXiv preprint arXiv:2305.12586 (2023).
- OpenAI (2023a)OpenAI. 2023a.ChatGPT.https://chat.openai.com/chat.Accessed: [2023.10.01].
- OpenAI (2023b)R OpenAI. 2023b.GPT-4 technical report.arXiv (2023), 2303–08774.
- Patnaik etal. (2024)Sohan Patnaik, Heril Changwal, Milan Aggarwal, Sumita Bhatia, Yaman Kumar, and Balaji Krishnamurthy. 2024.CABINET: Content Relevance based Noise Reduction for Table Question Answering.arXiv preprint arXiv:2402.01155 (2024).
- Pourreza and Rafiei (2024a)Mohammadreza Pourreza and Davood Rafiei. 2024a.Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in Neural Information Processing Systems 36 (2024).
- Pourreza and Rafiei (2024b)Mohammadreza Pourreza and Davood Rafiei. 2024b.DTS-SQL: Decomposed Text-to-SQL with Small Large Language Models.arXiv preprint arXiv:2402.01117 (2024).
- Rajkumar etal. (2022)Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022.Evaluating the text-to-sql capabilities of large language models.arXiv preprint arXiv:2204.00498 (2022).
- Ren etal. (2024)Tonghui Ren, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, and XSean Wang. 2024.PURPLE: Making a Large Language Model a Better SQL Writer.arXiv preprint arXiv:2403.20014 (2024).
- Serebryany (2016)Kosta Serebryany. 2016.Continuous fuzzing with libfuzzer and addresssanitizer. In 2016 IEEE Cybersecurity Development (SecDev). IEEE, 157–157.
- Shen etal. (2023)Chenhui Shen, Liying Cheng, Yang You, and Lidong Bing. 2023.Are large language models good evaluators for abstractive summarization?arXiv preprint arXiv:2305.13091 (2023).
- Sun etal. (2023a)Ruoxi Sun, SercanO Arik, Hootan Nakhost, Hanjun Dai, Rajarishi Sinha, Pengcheng Yin, and Tomas Pfister. 2023a.SQL-PaLM: Improved Large Language ModelAdaptation for Text-to-SQL.arXiv preprint arXiv:2306.00739 (2023).
- Sun etal. (2023b)Ruoxi Sun, SercanÖ Arik, Rajarishi Sinha, Hootan Nakhost, Hanjun Dai, Pengcheng Yin, and Tomas Pfister. 2023b.SQLPrompt: In-Context Text-to-SQL with Minimal Labeled Data.arXiv preprint arXiv:2311.02883 (2023).
- Sun (2023)Zhongxiang Sun. 2023.A short survey of viewing large language models in legal aspect.arXiv preprint arXiv:2303.09136 (2023).
- Thorpe etal. (2024)DaytonG. Thorpe, AndrewJ. Duberstein, and IanA. Kinsey. 2024.Dubo-SQL: Diverse Retrieval-Augmented Generation and Fine Tuning for Text-to-SQL.arXiv preprint arXiv:2404.12560 (2024).https://api.semanticscholar.org/CorpusID:269282528
- Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal. 2023.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288 (2023).
- Volvovsky etal. (2024)Shai Volvovsky, Marco Marcassa, and Mustafa Panbiharwala. 2024.DFIN-SQL: Integrating Focused Schema with DIN-SQL for Superior Accuracy in Large-Scale Databases.arXiv preprint arXiv:2403.00872 (2024).
- Wang etal. (2023)Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, and Zhoujun Li. 2023.Mac-sql: Multi-agent collaboration for text-to-sql.arXiv preprint arXiv:2312.11242 (2023).
- Wang etal. (2019)Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2019.Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers.arXiv preprint arXiv:1911.04942 (2019).
- Wang etal. (2022)Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022.Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171 (2022).
- Wu etal. (2023)Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023.Bloomberggpt: A large language model for finance.arXiv preprint arXiv:2303.17564 (2023).
- Xia etal. (2024)Hanchen Xia, Feng Jiang, Naihao Deng, Cunxiang Wang, Guojiang Zhao, Rada Mihalcea, and Yue Zhang. 2024.SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning.arXiv preprint arXiv:2402.14851 (2024).
- Xie etal. (2024)Yuanzhen Xie, Xinzhou Jin, Tao Xie, MingXiong Lin, Liang Chen, Chenyun Yu, Lei Cheng, ChengXiang Zhuo, Bo Hu, and Zang Li. 2024.Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm.arXiv preprint arXiv:2402.10671 (2024).
- Yang etal. (2023)Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, etal. 2023.Baichuan 2: Open large-scale language models.arXiv preprint arXiv:2309.10305 (2023).
- Yu etal. (2018)Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, etal. 2018.Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887 (2018).
- Zhang etal. (2024)Chao Zhang, Yuren Mao, Yijiang Fan, Yu Mi, Yunjun Gao, Lu Chen, Dongfang Lou, and Jinshu Lin. 2024.FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis.arXiv preprint arXiv:2401.10506 (2024).
- Zhang etal. (2023)Hanchong Zhang, Ruisheng Cao, Lu Chen, Hongshen Xu, and Kai Yu. 2023.Act-sql: In-context learning for text-to-sql with automatically-generated chain-of-thought.arXiv preprint arXiv:2310.17342 (2023).
- Zheng etal. (2023)Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei Ye, and Quanquan Gu. 2023.Structure-informed language models are protein designers. In International Conference on Machine Learning. PMLR, 42317–42338.
- Zhong etal. (2017)Victor Zhong, Caiming Xiong, and Richard Socher. 2017.Seq2sql: Generating structured queries from natural language using reinforcement learning.arXiv preprint arXiv:1709.00103 (2017).