# ZeroNL2SQL

**Repository Path**: xiaofengalg/ZeroNL2SQL-official

## Basic Information

- **Project Name**: ZeroNL2SQL
- **Description**: 结合LLM和传统nl2sql模型
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2023-12-19
- **Last Updated**: 2024-04-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ZeroNL2SQL
## :thought_balloon: Introduction
This repository contains the code for the paper [``Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation''](https://arxiv.org/abs/2306.08891).

## :open_file_folder: Data Preparation
### Train data
- [Spider](https://drive.google.com/uc?export=download&id=1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0): Put it under `src/datasets/spider`.
### Test data
- [KaggleDBQA](https://drive.google.com/file/d/1rckVlUcZ1EB7pLJBUzrfeZlEZIc8ruCr/view?usp=share_link): Put it under `src/datasets/kaggledbqa`.
- [Dr.Spider](https://drive.google.com/file/d/1VN0S5Q5NbFe8MGB21T9f6psGw93LRz00/view?usp=share_link): Put it under `src/datasets/drspider`.

```sh
mkdir data/
unzip src/datasets/kaggledbqa/kaggledbqa.zip -d data/
unzip src/datasets/drspider/drspider.zip -d data/
# Don't delete the original .zip file
```
## :computer: Environment Preparation
![PWC](https://img.shields.io/badge/Python-3.8.3-green)

Please refer to `requirements.txt` to download the relevant toolkits.

Prepare the following folders:
```sh
cd ZeroNL2SQL
mkdir logs
mkdir experimental_outputs/train/template_generator
mkdir experimental_outputs/train/aligner
```

## :zap: Quick Start

### Download models
- [Template Generator](https://drive.google.com/drive/folders/1rkHvECBv7zO58q6_kNSFvfEHeRJGZiP1?usp=sharing): Put it under `experimental_outputs/train/template_generator`.
- [Aligner](https://drive.google.com/file/d/1IvvyYo_S2muVr4HyxHyhZifFlIRlPToD/view?usp=share_link): Put it under `experimental_outputs/train/aligner`.

### Text-to-SQL inference

Use the following script to directly infer on the text-to-sql test set. This script will take four steps: 1. generate SQL template; 2. align (SELECT, STRUCTURE) with the user question; 3. prepare data for LLM inference; 4. text2sql using LLM.
```sh
CUDA_VISIBLE_DEVICES={gpu_id} bash scripts/infer_LLM_with_template.sh test_set_name your_openai_key
```

- The first argument is the name of the test set, which can be selected from `kaggledbqa`, `DB_DBcontent_equivalence`, `DB_schema_abbreviation`, `DB_schema_synonym`, `NLQ_keyword_synonym`, `NLQ_keyword_carrier`, `NLQ_column_synonym`, `NLQ_column_carrier`, `NLQ_column_attribute`, `NLQ_column_value`, `NLQ_value_synonym`, `NLQ_multitype`, `NLQ_others`, `SQL_comparison`, `SQL_sort_order`, `SQL_NonDB_number`, `SQL_DB_text`, `SQL_DB_number`. 
- The second argument is your openai key, which you can obtain from the [official website](https://platform.openai.com/account/api-keys).

Note that we evaluate the text-to-SQL results using the [test_suite_evaluation](https://github.com/taoyds/test-suite-sql-eval), and the evaluation results are presented in `eval.output`.

## :open_hands: Train From Scratch

### Train template generator

```sh
CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run.py configs/train_template_generator.json"
```
The best model will be saved at `experimental_outputs/train/template_generator/BEST_MODEL/`.

### Train aligner

```sh
CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run_aligner.py configs/train_aligner.json"
```
The best model will be saved at `experimental_outputs/train/aligner/checkpoint_best.pkl`.

## :speech_balloon:Citation

If our code is helpful to you, please cite our work:
```sh
@misc{gu2023interleaving,
      title={Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation}, 
      author={Zihui Gu and Ju Fan and Nan Tang and Songyue Zhang and Yuxin Zhang and Zui Chen and Lei Cao and Guoliang Li and Sam Madden and Xiaoyong Du},
      year={2023},
      eprint={2306.08891},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```